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The  military  services  have  a  vital  concern  in  assuring  that  aptitude  test 
scores  are  appropriate  measures  of  examinees'  true  abilities.  Substantial 
bonuses  have  been  paid  to  examinees  with  sufficiently  high  scores  as 
enticement  to  enlist  into  selected  occupations.  Under  mobilization,  exemption 
from  service  will  be  given  to  examinees  with  unacceptably  low  scores. 
Therefore,  cheating  to  improve  scores  and  deliberately  picking  Incorrect 
answers  to  lower  scores  are  both  plausible  threats  to  the  integrity  of 
enlistment  testing.  The  goal  of  Appropriateness  Measurement  is  to  develop 
ways  to  analyze  examinees'  responses  to  multiple-choice  tests  so  as  to 
identify  such  inappropriate  test  responding. 

This  effort  evaluates  11  practical  appropriateness  indices.  Three,  which 
are  based  on  modern  test  theory  (Item  Response  Theory),  were  found  to 
effectively  detect  aberrant  response  patterns  across  a  fairly  wide  range  of 
conditions.  This  success  was  obtained  when  the  test  had  many  items  but  was 
substantially  lessened  for  military  selection  test  lengths.  However,  methods 
developed  for  combining  information  on  aberrant  responding  across  several 
different  tests  resulted  in  an  effectiveness  comparable  to  that  found  with  the 
longer  tests. 

The  results  strongly  suggest  that  appropriateness  indices  can  be  used 
effectively  in  operational  settings.  Further  research  is  suggested  on  a  class 
of  indices  called  "optimal"  which  hold  the  promise  of  even  better 
identification  of  aberrant  responding  than  those  Indices  already  identified. 
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PREFACE 


This  effort  was  accomplished  under  Project  7719,  "Development  and 
Validation  of  Selection  Methodologies. ”  It  represents  the  continuing  effort 
of  the  Air  Force  Human  Resources  Laboratory  to  fulfill  its  research  and 
development  (R&D)  responsibilities  through  development  and  application  of 
state-of-the-art  methodologies  for  the  continued  improvement  of  the  Armed 
Services  Vocational  Aptitude  Battery  (ASVA2). 

We  wish  to  thank  Bruce  Williams  and  Gregory  L.  Candell  for  their  help  in 
conducting  the  research  described  in  Chapter  III.  They  will  be  coauthors  of 
the  paper  summarizing  this  research  when  it  is  submitted  for  journal 
publication. 
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INTRODUCTION  AND  OVERVIEW 


Some  examinees'  scores  on  a  multiple-choice  test  may  fail  to  provide 
valid  measures  of  the  trait  measured  by  the  test.  Examinees  can  obtain 
spuriously  high  scores  because  they  copy  answers  from  more  talented  neighbors 
or  because  they  have  been  given  the  answers  to  some  questions.  Examinees  can 
obtain  spuriously  low  scores  due  to  alignment  errors  (answering,  say,  the 
tenth  item  in  the  space  provided  for  the  ninth  item,  answering  the  eleventh 
item  in  the  space  provided  for  the  tenth  item,  etc.),  language  difficulties, 
atypical  educations,  and  unusually  creative  interpretations  of  normally  easy 
items. 

Detecting  inappropriate  test  scores  is  very  important  in  military 
testing.  For  example,  substantial  recruitment  bonuses  may  be  erroneously  paid 
to  low  ability  examinees  who  obtain  spuriously  high  test  scores.  Many  of 
these  individuals  are  likely  to  fail  to  complete  military  technical  training 
schools;  this  leads  to  high  attrition  costs.  Even  when  such  individuals  are 
able  to  complete  training,  they  are  likely  to  exhibit  low  on-the-job 
performances . 

Spuriously  low  scores  can  also  cause  serious  difficulties  in  military 
testing.  Spuriously  low  scores  can  lead  to  difficulties  in  filling  important 
manpower  needs  because  truly  able  individuals  will  be  inappropriately 
disqualified.  This  problem  is  likely  to  be  exacerbated  in  the  future  as  the 
birthrates  of  many  demographic  groups  decline. 

The  goal  of  Appropriateness  Measurement  is  to  identify  inappropriate  test 
scores.  In  recent  years,  several  methods  for  identifying  these  test  scores 
have  been  devised.  In  all  approaches,  response  patterns  are  characterized  in 
a  way  that  permits  us  to  assess  quantitatively  the  degree  to  which  an  observed 
response  vector  is  atypical.  This  quantitative  measure  is  then  used  to 
classify  response  patterns  into  appropriate  (i.e.,  normal)  and  inappropriate 
(i.e.,  aberrant)  categories. 

In  a  series  of  studies,  it  has  been  found  that  simulated  spuriously  high 
response  patterns  and  simulated  spuriously  low  response  patterns  can  be 
detected  by  appropriateness  measurement.  High  detection  rates  have  been 
obtained  despite  model  misspecif ication,  errors  in  item  characteristic  curve 
parameter  estimates,  and  the  inclusion  of  inappropriate  response  patterns  in 
the  test  norming  sample  (Levine  &  Drasgow,  1982).  Very  high  detection  rates 
have  been  obtained  when  response  patterns  of  low  ability  examinees  have  been 
modified  to  simulate  cheating  and  when  response  patterns  of  high  ability 
examinees  have  been  modified  to  simulate  spuriously  low  responding  (Drasgow, 
Levine,  &  Williams,  1985). 

Among  the  many  methods  that  have  been  proposed,  which  is  best  for 
detecting  inappropriate  test  scores  on  the  short  unidimensional  power  subtests 
from  the  Armed  Services  Vocational  Aptitude  Battery  (ASVAB)?  Also,  is  there 
some  clearly  superior  method  that  has  not  yet  been  proposed?  Previous 
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research  on  Appropriateness  Measurement  has  generally  focused  on  long 
unidimensional  tests  such  as  the  Scholastic  Aptitude  Test-Verbal  section  (SAT- 
V)  and  the  Graduate  Record  Examination-Verbal  section  (GRE-V).  The  research 
described  in  this  paper  was  designed  to  determine  which  of  these  indices  is 
best  for  ASVAB  subtests  (in  particular,  the  portion  known  as  the  Armed  Forces 
Qualification  Test  or  AFQT)  and,  as  described  below,  to  decide  if  the  best 
method  currently  available  could  be  significantly  improved. 

The  difficult  problem  of  evaluating  the  effectiveness  of  an 
appropriateness  index  was  recently  solved  to  a  large  extent  by  Levine  and 
Drasgow  (1984;  1987).  They  developed  statistical  theory  and  numerical  methods 
that  enabled  them  to  compute  opt imal  appropriateness  indices  for  given  forms 
of  aberrance.  These  indices  are  optimal  in  the  sense  that  no  other  statistics 
computed  from  an  examinee's  item  responses  can  achieve  higher  rates  of 
detection  (at  each  error  level)  of  given  forms  of  aberrance.  Thus,  the 
absolute  effectiveness  of  any  practical,  easy-to-compute  appropriateness 
index  previously  suggested  in  the  literature  can  be  determined  by  comparing  it 
to  an  optimal  index. 

Many  appropriateness  indices  were  evaluated  in  the  present  effort.  The 
best  practical  appropriateness  indices  based  on  Item  Response  Theory  (IRT) 
were  found  to  be  far  superior  to  non-IRT  alternatives,  such  as  the 
standardized  residual  from  a  multiple  regression  equation.  In  some  cases,  the 
best  practical  indices  had  detection  rates  that  were  nearly  as  high  as  the 
detection  rates  of  optimal  appropriateness  indices.  In  other  situations, 
optimal  indices  provided  far  higher  detection  rates. 

At  present,  optimal  indices  show  promise  for  use  in  operational  settings. 
With  further  development,  optimal  indices  could  be  used  to  provide  powerful 
detection  of  specific  forms  of  aberrance  that  are  difficult  to  detect  using 
even  the  best  practical  indices.  For  example,  suppose  a  test  score  falls  into 
AFQT  Category  3A.  Does  the  examinee  truly  belong  to  this  ability  category? 

Or  is  the  examinee  actually  an  AFQT  Category  3B  examinee  who  was  unethically 
given  the  answers  to  a  moderate  number  of  items?  An  optimal  index  can  be 
formulated  to  test  such  hypotheses. 

In  the  first  study  described  in  this  report,  11  practical  appropr iateness 
indices  were  evaluated  and  compared  to  optimal  indices.  Simulated  SAT-V  data 
were  used  in  the  first  study  because  many  of  the  practical  indices  were 
originally  proposed  in  the  context  of  a  long  unidimensional  test.  Optimal 
indices  were  found  to  provide  very  high  rates  of_ detection  of  inappropriate 
response  patterns.  The  best  practical  indices  were  nearly  optimal  in  some 
conditions  but  fell  short  of  optimal  in  other  conditions. 

In  the  second  study  conducted  for  this  effort,  the  effectiveness  of  each 
of  the  practical  and  optimal  indices  on  a  short  unidimensional  test  was 
evaluated  using  simulated  ASVAB  Arithmetic  Reasoning  (AR)  subtest  data.  Rates 
of  detection  of  aberrant  response  patterns  were  found  to  be  substantially 
reduced  for  the  short  AR  subtest  in  relation  to  the  long  SAT-V  test. 

Methods  were  then  developed  for  combining  information  about  aberrance 
across  several  short  unidimensional  tests.  Simulated  and  actual  ASVAB  data 
for  the  AR,  Word  Knowledge  (WK),  and  Paragraph  Comprehension  (PC)  subtests 
were  used  to  evaluate  the  multi-test  appropriateness  indices.  By  increasing 
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the  number  of  items  from  30  on  the  AR  test  to  80  on  the  combined  AR,  WK,  and 
PC  subtests,  we  obtained  detection  rates  that  were  comparable  to  the  85-item 
SAT-V. 

The  following  chapters  describe  the  present  research  and  development 
(R&D)  effort,  provide  concluding  remarks,  and  suggest  directions  for  future 
R&D.  The  results  strongly  suggest  that  appropriateness  indices  based  on  IRT 
can  be  used  effectively  in  operational  settings.  Further  significant  gains  in 
detection  rates  are  expected  if  optimal  indices  are  developed  for  use  in 
operational  settings. 


II.  DETECTING  INAPPROPRIATE  TEST  SCORES  ON  A  LONG  UNI  DIMENSIONAL 
TEST  WITH  OPTIMAL  AND  PRACTICAL  APPROPRIATENESS  INDICES 


Introduction 

It  is  relatively  easy  to  propose  new  appropriateness  indices. 
Unfortunately,  evaluations  of  the  relative  merits  of  the  various  indices  have 
been  very  difficult  in  previous  research.  Cliff's  (1979,  p.  388)  description 
of  a  related  problem  cogently  summarized  the  difficulty  in  evaluating  indices: 
"Now  the  trouble  is  that  the  formulas  multiply  not  just  like  rabbits,  or  even 
guppies,  but  rather  like  amoebae:  by  both  fusion  and  conjugation,  and  there 
seemed  to  be  no  general  principle  to  use  in  selecting  from  among  them." 
Harnisch  and  Tatsuoka  (1983),  for  example,  correlated  14  different  indices  in 
order  to  see  which  pairs  were  more  and  less  related,  but  this  approach  has 
limited  value  in  determining  which  index  is  best.  Furthermore,  this  approach 
does  not  determine  which  indices,  if  any,  are  good  enough  for  operational  use. 

In  the  past,  two  criteria  have  been  used  to  evaluate  appropriateness 
indices:  standardization  and  relative  power.  Standardization .  introduced  by 
Drasgow,  Levine,  and  Williams  (1985),  refers  to  the  extent  to  which  the 
conditional  distributions  (given  particular  values  of  the  latent  trait)  of  an 
index  are  invariant  across  levels  of  the  latent  trait.  There  is  little 
confounding  between  ability  and  measured  appropriateness  for  a  well- 
standardized  index.  Well  standardized  indices  have  two  attractive  features. 
First,  high  rates  of  detection  of  aberrant  response  patterns  by  well- 
standardized  indices  cannot  be  due  merely  to  differences  in  ability 
distributions  or  number-right  distributions  across  normal  and  aberrant 
samples.  In  contrast,  high  detection  rates  obtained  by  poorly  standardized 
indices  may  be  due  largely  to  differences  in  ability  distributions.  This 
point  is  illustrated  in  a  later  section  of  this  chapter.  Second,  a  well- 
standardized  index  is  easy  to  use  in  practice  because  index  scores  for 
individuals  with  different  standings  on  the  latent  trait  can  be  compared 
directly.  In  contrast,  scores  on  poorly  standardized  indices  can  be 
interpreted  only  in  relation  to  their  conditional  distributions;  consequently, 
a  single  cutting  score  for  classification  into  aberrant  and  appropriate  groups 
is  not  possible.  Furthermore,  it  is  sometimes  very  difficult  and  time- 
consuming  to  obtain  the  conditional  distributions  of  an  appropriateness  index. 
In  such  cases,  the  practical  usefulness  of  the  index  is  limited. 
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Relative  power,  the  second  criterion  used  to  evaluate  appropr iateness 
indices,  refers  to  the  ability  of  a  particular  index  to  correctly  classify 
aberrant  response  patterns  as  aberrant,  compared  to  the  classification  rate  of 
another  index.  If  some  well-standardized  index  has  acceptable  power,  then  it 
can  be  used  in  operational  settings.  Unfortunately,  no  unequivocal 
conclusions  about  the  detectability  of  some  form  of  aberrance  are  possible  if 
none  of  the  indices  under  consideration  has  adequate  power.  We  do  not  know 
whether  or  not  there  exists  some  other  index,  not  included  in  the 
experimental  study,  that  has  acceptable  power.  In  addition,  even  if  an  index 
were  found  to  have  adequate  power  for  operational  use,  we  do  not  know  whether 
or  not  there  is  an  index,  as  yet  undiscovered,  that  is  substantially  superior 
to  all  known  indices. 

It  is  now  possible  to  determine  the  detectability  of  a  specified  form  of 
aberrance  by  the  methods  devised  by  Levine  and  Drasgow  (1984;  1987).  They 
introduced  a  general  method  for  ascertaining  the  maximum  power  that  can  be 
achieved  by  any  index.  Chapters  2,  3,  and  4  contain  the  first  major 
applications  of  the  method. 

By  means  of  a  new  numerical  algorithm,  Levine  and  Drasgow  (1984;  1987) 
were  able  to  apply  the  Neyman-Pearson  Lemma  to  specify  an  appropriateness 
index  that  is  optimal  in  the  sense  that  no  other  index  computed  from  the  item 
responses  can  achieve  a  higher  detection  rate  (at  each  error  rate)  of  the 
given  form  of  aberrance. 

As  a  result  of  their  research,  it  is  now  possible  to  determine  the 
absolute  effectiveness  of  an  index  for  detecting  a  particular  type  of 
aberrance  on  a  given  test.  The  absolute  effectiveness  of  an  index  is 
determined  by  comparing  its  detection  rate  with  the  detection  rate  of  the 
corresponding  optimal  index.  In  the  first  study  conducted  for  this  effort, 

11  different  appropriateness  indices  were  evaluated  for  their  abilities  to 
detect  spuriously  high  and  low  response  patterns  on  a  long  unidimensional 
power  test:  namely,  the  SAT -V. 

The  appropriateness  indices  examined  in  the  first  study  and  some 
computational  notes  are  presented  in  the  next  section.  The  extent  to  which 
each  index  is  standardized  and  the  power  of  each  index  for  detecting  several 
forms  of  aberrance  are  then  examined.  Some  remarks  concerning  the  results 
are  provided  in  the  final  section  of  this  chapter. 


>ropr iateness  Indices 


Optimal  Indices 


Suppose  we  wish  to  test  a  simple  null  hypothesis  against  a  simple 
alternative  hypothesis.  If  the  probability  of  a  Type  I  error  is  a,  then  the 
most  powerful  test  is  the  test  that  minimizes  the  probability  of  a  Type  II 
error  among  the  set  of  tests  with  the  given  Type  I  error  rate.  The  Neyman- 
Pearson  Lemma  states  that  maximum  power  is  achieved  by  a  likelihood  ratio 
test.  More  specifically,  let  ( x )  and  L^(x)  denote  the  likelihoods  of  the 

data  x  under  the  null  and  alternative  hypotheses,  respectively.  Then  the 
Neyman-Pearson  Lemma  states  that  of  all  tests  with  a  Type  I  error  rate  of  a, 


none  is  more  powerful  than  a  test  obtained  from  the  likelihood  ratio 

la(.)/ln(i)  . 

The  Neyman-Pearson  Lemma  can  be  applied  in  the  context  of  Appropriateness 
Measurement  to  construct  most  powerful  tests  and,  consequently,  optimal 
appropriateness  indices.  To  see  how  it  is  used,  suppose  that  local 
independence  holds,  u  =  (u  ^ ,  ...,  u^),  and  P^u^lG)  is  the  probability  of 

response  u^  to  item  _i  by  an  examinee  of  ability  9  under  the  null  hypothesis 

that  the  response  pattern  is  appropriate  (normal).  Then  the  likelihood  of  a 
response  vector  u  by  an  examinee  of  ability  9  is 


^Normal(u|0) 


n 

n 

i=  1 


P.(u. 

—l—i 


0). 


If  the  ability  density  is  f(9),  then  using  elementary  probability 


^Normal ^ 


PNormal 


( u| 9)  f(0)  d9. 


To  apply  the  Neyman-Pearson  Lemma,  it  is  necessary  to  compute 
— Aberrant^  '  This  quantity  can  be  obtained  by  carrying  the  conditioning- 

integrating  argument  one  step  further.  For  concreteness,  suppose  that  the 
type  of  aberrance  under  consideration  consists  of  m  randomly  selected  items 
being  modified  by  the  spuriously  low  treatment.  Let  S,  denote  a  set 


indicating  the  kth  way  of  selecting  m  of  n  items  (of  the  (  )  ways  possible), 
let  Zfl5errant^u* 9*^)  denote  the  likelihood  of  response  pattern  u  for 
an  examinee  with  ability  0  when  the  items  in  are  subjected  to  the 
spuriously  low  treatment,  and  let  P(S^)  denote  the  probability  of  (i.e., 

KV  =  1/0  >•  Then 


-Aberrant^ 


=  I* Aberrant* u| 9’V 


so  that 


-Aberrant ^ 


1Z 


^berrant(ule'4) 


£<V> 


f(9)  d9. 


(1) 


By  taking  advantage  of  the  symmetry  in  the  hlberrant(u"5'V’  Levlne  an<) 

Drasgow  (1984)  obtained  an  efficient  numerical  algorithm  for  computing 

P..  t(ul9).  Using  a  numerical  quadrature  formula,  the  right-hand  side  of 

—Aberrant 

Equation  1  can  be  accurately  evaluated  with  an  acceptable  amount  of 
computation.  Details  about  these  calculations  and  a  theoretical  treatment  of 
the  general  problem  are  provided  by  Levine  and  Drasgow  (1984;  1987). 


Thus,  it  is  possible  to  compute  the  likelihood  ratio 


-Aberrant  -Normal  ' 


and  test  the  simple  null  hypothesis  that  a  response  pattern  is  normal  against 
the  simple  alternative  hypothesis  that  the  response  pattern  is  aberrant.  Due 
to  the  Neyman-Pearson  Lemma,  the  likelihood  ratio  statistic  provides  a  most 
powerful  test;  consequently,  when  it  is  used  as  an  appropriateness  index,  the 
likelihood  ratio  statistic  is  as  powerful  as  any  index  that  can  be  computed 
from  the  item  responses. 


Standardized  20 

Let  z3  denote  the  standardized  index  (Drasgow  et  al.,  1985).  It  may 
be  computed  by  the  formula 


a.  -  m(9) 
[S(0)]1/2 


(3) 


In  this  formula,  2„  is  the  logarithm  of  the  three-parameter  logistic 
likelihood  function  evaluated  at  the  maximum  likelihood  estimate  9  of  9: 


n 

*.  =  l  [u .log  P .(0)  ♦  (1-u  )log  Q. (0) ] , 

i  =  1 


where  u^  is  the  dichotomously  scored  (Ucorrect,  O  =  incorrect)  item  response 
for  item  i_,  _i  =  1 ,  2,  . .  . ,  n;  (^(0)  =  1  -  £i(0) ; 


1  -  c. 

P  (0)  =  c  >  - - T -  ;  (14) 

1  +  expt-Da^O-b^  ] 


£  =  1.702;  and  a^,  b^,  and  are  item  parameter  estimates. 

The  conditional  expectation  of  20,  given  9=9,  is 

A  II  »  A  »  A 

M(0)  =  l  [P  (0)  log  P  (0)  ♦  Q  (0)  log  Q(9  ) ]  (5) 

i=1  1 

and  its  conditional  variance  is 

n  . 

S(0)  =  l  {Pi(0)Q1(0)(log(Pi(0)/Qi(0))]M.  (6) 

i  =  1 


Justifications  of  these  formulas  can  be  found  in  Drasgow  et  al.  (1985). 


Fit  Statistics 


a 


1 


I 
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Two  fit  statistics  for  the  three-parameter  logistic  model  were  suggested 
by  Rudner  (1983)  as  generalizations  of  the  Rasch  model  fit  statistics  used  by 
Wright  (1977)  and  his  colleagues.  The  first  is  the  mean  squared  standardized 
residual 

.  n  .  . 

FI  -il  [u  -  P.(e)]a/[P.(0)<J,(e>].  (7) 

—  i= 1  1  1 

The  other  fit  statistic  is 

n  -  n 

F2  =  l  [u.  -  P.(0)]V  [  M0)Q.(0),  (8) 

i=  1  11  i= 1  1  1 

which  Rudner  found  to  be  quite  effective  in  some  cases  (see  Rudner,  1983,  p. 
214  and  p.  216,  where  Rudner  uses  W3  to  denote  an  expression  proportional  to 
Equation  8). 

Likelihood  Function  Curvature  Statistics 


Four  indices  that  provide  measures  of  the  "flatness"  of  the  likelihood 
function  were  evaluated.  These  indices  are  motivated  by  the  notion  that 
inappropriate  responses  will  flatten  the  likelihood  function  near  its  maximum 
because  no  single  value  of  0  will  allow  the  item  response  model  to  provide  a 
good  fit  to  the  response  vector.  Therefore,  the  likelihood  function  will  not 
have  a  sharp  maximum;  instead,  it  will  be  relatively  flat. 

Normalized  Jackknife.  The  first  measure  of  the  curvature  of  the 
likelihood  function  is  the  normalized  Jackknife  variance  estimate.  In  order 

to  compute  this  index,  let  0  denote  the  three- parameter  logistic  maximum 

likelihood  estimate  of  ability  based  on  all  n  test  items  and  let  0^  denote 

the  estimate  based  on  the  n  -  1  items  remaining  when  item  J_  is  excluded.  The 
pseudo-values  (see,  for  example,  Mosteller  &  Tukey,  1968)  are 


=  n0  -  (n-1)0, 


1  =  1,2, 


The  Jackknife  estimate  of  0  is  then 


i  n  -« 

:  i  [  ! 
sjii  J 


and  the  Jackknife  estimate  of  its  variance  is 


Var(0  ) 


_  1  r  "  * 

-  hi  »,>' 


n(n-1 ) 


V.N.V.N  .--.V. 


.  *  '  •  '  »  "  -  '  •  *  *  *  »  * 
■  •  *  •  ^  > 


The  Jackknife  variance  estimate  is  not  a  standardized  appropriateness 
index;  there  is  more  Fisher  information  about  0  in  some  ability  ranges  than  in 

others,  and  so  Var(0  )  is  expected  to  depend  upon  0.  Lord’s  (1980)  formula 
for  the  information  of  the  three-parameter  logistic  maximum  likelihood 
estimate  of  0, 


n  (M0)']2 

-9)  =  ^(0)^0)’ 


(9) 


can  be  used  to  reduce  this  problem.  Since  the  reciprocal  of  _H0)  is  the 

asymptotic  variance  of  0,  the  jackknife  estimate  of  variance  can  be 

approximately  normalized  by  evaluating  the  information  function  at  0  and 
computing 


JK  =  Var (0  )I(0). 


(10) 


It  is  possible  to  arrange  the  calculations  for  computing  JK  very 
efficiently.  We  found  that  one  Newton-Raphson  iteration  was  adequate  to  move 

A  A 

from  0  to  0(j).  Then,  since  the  first  and  second  derivatives  of  the  log 

likelihood  functions  for  the  whole  test  are  sums  over  ri  items,  the  first  and 
second  derivatives  of  the  log  likelihood  functions  for  the  n-1  item  test  can 
be  obtained  by  single  subtractions  of  already  computed  quantities. 

*  * 

Consequently,  all  the  pseudo-values,  0  ,  and  JK  can  be  obtained  with  fewer 
arithmetic  calculations  than  are  required  in  a  single  Newton -Raphson  iteration 

A 

in  the  calculation  of  0. 


Convergence  of  0.  A  possible  consequence  of  a  relatively  flat  likelihood 
function  for  aberrant  response  patterns  is  that  the  number  of  iterations 

required  to  obtain  0  may  be  increased.  The  number  NI  of  Newton-Raphson 

iterations  required  to  obtain  0  can  therefore  be  used  as  an  appropriateness 
index. 

Expected  versus  Observed  Likelihood  Function  Curvatures.  This  index 
(0/E)  is  also  motivated  by  an  hypothesis  about  the  likelihood  function's 
flatness.  If  the  likelihood  function  is  flatter  for  aberrant  response 
patterns  than  for  normal  response  patterns,  then  we  would  expect  that  the 
observed  information,  defined  as  minus  the  second  derivative  of  the  log 

likelihood  function  at  0  given  the  response  vector  u  (see  Efron  4  Hinkley, 

1978,  p.  457),  would  be  less  than  the  information  J_(0)  given  in  Equation  9, 

which  (given  0)  does  not  depend  upon  u.  Thus,  the  sixth  index  is  the  ratio  of 
the  observed  and  expected  information 


where  2  is  the  log  likelihood 


n 

(12)  4  =  l  [u.log  P  (0)  +  (l-u.)log  0.(0)].  (12) 

i=1  1  1  11 

Bayes  Posterior  Variance.  Another  statistic  closely  related  to  the  0/E 

index  is  the  posterior  variance  B  of  the  Bayes  estimate  of  9.  It  is  expected 
to  be  relatively  large  for  aberrant  response  vectors  and  relatively  small  for 
normal  response  vectors.  Thus,  it  should  serve  to  distinguish  between  normal 
and  aberrant  response  patterns. 

Item-Option  Variance 

Suppose  that  we  consider  the  subset  of  N .  ^  examinees  in  the  test  norming 

sample  who  selected  option  k.  to  item  It  is  easy  to  compute  the  mean 

number-right  ^core  for  these  examinees.  In  this  way,  we  can  identify 

options  to  item  _i  that  are  typically  selected  by  high  ability  examinees  (e.g. 
the  correct  option)  and  options  that  are  typically  selected  by  lower  ability 
examinees.  For  spuriously  high  and  low  response  patterns,  we  would  expect  to 

observe  inconsistency  in  sometimes  options  with  low  are  selected  and 

sometimes  options  with  high  are  selected.  For  this  reason,  we  evaluated 
the  item-option  variance 


IOV  =  Var(lik) 

as  a  measure  of  appropriateness. 

Caution  Indices 


Sato's  Caution  Index.  Three  "caution  indices"  were  also  be  examined. 

The  first  is  Sato's  (1975)  caution  index  S  (see  also  Tatsuoka  &  Linn,  1983, 
but  replace  y#j  with  for  a  simpler  version  of  their  Equation  1).  S  is 

easy  to  compute  and  is  widely  used  in  Japan.  To  compute  S,  suppose  that  the  n 
test  items  are  ordered  from  easiest  to  hardest  on  the  basis  of  proportion 

right  £.  in  the  test  norming  sample.  Let 


be  the  mean  proportion  correct  and  suppose  that  an  examinee  answers  k  items 

correctly.  If  p  is  a  vector  containing  the  £.  and  g  is  a  perfect  Guttman 

response  pattern  with  Is  as  its  first  k  elements  and  Os  for  the  next  n-i£ 
elements,  then 


s 


1 


Cov(u,p) 

Cov(g.p) 


i 

i 

n 

I  “«(£«  -  £) 

■  -  ■ 

-  i> 

Note  that  the  sununation  in  the  denominator  of  the  last  expression  is  from  1  to 
k  (i.e.,  over  the  k  items  with  the  smallest  £.  values),  not  1  to  n. 


Tatsuoka's  Standardized  Extended  Caution  Indices.  Two  indices  that  are 
related  to  Sato's  caution  index  are  the  second  and  fourth  standardized 
extended  caution  indices  T2  and  T4  presented  by  Tatsuoka  (1984,  p.  104). 
These  two  indices  (of  the  four  studied  by  Tatsuoka)  were  included  because 
Harnisch  and  Tatsuoka  (1983)  found  that  these  indices  were  not  related 

(linearly  or  curvilinear ly )  to  true  score  and,  therefore,  9. 


T2  and  T4  can  be  computed  relatively  easily.  Let  0j  denote  the  three- 

parameter  logistic  maximum  likelihood  estimate  of  ability  for  the  Jth  person 

in  the  test  norming  sample  of  N  examinees,  and  let  P^Sj)  be  the  probability 

of  a  correct  response  to  item  _i  by  this  person  computed  from  Equation  4.  Then 
define 


» 

t 


and 


N 


=  ij., 


l  w 


G  =  i  l  G.. 

n  -i 

To  compute  T2  and  T4  for  an  examinee  in  the  normal  sample  or  an  aberrant 
sample,  let 


Then 


£  =  ;  I  P.OI- 

-  1=1 
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v  v  \  '  ■  *  •/ 


T2 


(15) 


i  [(P^e)  -  ut)(G.  -  G) 

[I  Pi(e)Qi(6)(Gi-G)2]1/2 

and 

l  [(P:(6)  -  u  )(P  (0)  -  £) 

•pl|  -  _ 1  - -x  _ 1 _ 

[ l  Pi(e)Qi(0)<p.(©)-p)2) 1/2  ‘ 

It  should  be  noted  that  Equations  14,  15,  and  16  are  generalizations  of 
the  original  caution  indices  to  the  situation  where  item  parameters  are 
estimated  in  a  test  norming  sample. 

S  tandardiza tion 


Problem 


Measured  appropriateness  can  be  confounded  with  ability.  Drasgow  et  al . 
(1985,  p.  74),  for  example,  provide  an  example  of  a  strong,  nearly  linear 
relation  between  estimated  ability  and  an  unstandardized  index.  A  score  of, 
say,  -50  on  this  index  at  one  ability  level  indicates  a  good  fit  of  the  model 
to  a  response  vector,  but  the  same  index  score  at  other  ability  levels 
indicates  a  very  poor  fit.  Consequently,  an  observed  difference  between  the 
distributions  of  index  scores  for  normal  and  aberrant  response  vectors  is  not 
unequivocal  evidence  of  index  effectiveness.  Instead,  it  may  simply  reflect 
differences  in  ability  or  number-right  distributions.  This  problem  does  not 
occur  if  an  appropriateness  index  is  well  standardized;  that  is,  if  the 
conditional  distributions  (given  9)  of  the  index  are  (approximately)  equal 
across  possible  values  of  0  for  normal  examinees. 

In  practical  applications  of  Appropriateness  Measurement,  it  would  be 
convenient  if  a  single  cutting  score  could  be  used  to  classify  response 
patterns  as  aberrant  or  normal.  If  the  conditional  distributions  of  an  index 
are  not  identical,  then  the  interpretation  of  a  score  on  a  practical 
appropriateness  index  must  be  made  vis-a-vis  the  associated  conditional 
distribution.  Consequently,  it  would  not  be  possible  to  use  a  single  cutting 
score  to  classify  response  patterns  as  aberrant  nor  would  it  be  possible  to 
compare  directly  index  scores  of  examinees  with  differing  abilities. 

We  would  expect  little  degradation  of  the  performance  of  a  well- 
standardized  index  if  the  ability  distribution  were  to  change  abruptly.  Such 
a  change  would  be  expected,  for  example,  with  the  ASVAB  examinee  population  in 
a  period  of  national  mobilization. 

ROC  Curves 


If  an  index  is  properly  standardized,  its  distribution  will  be  nearly  the 
same  in  subpopulations  of  normal  examinees  who  differ  in  ability.  Hence,  the 
index  could  not  be  used  to  distinguish  among  groups.  A  standard,  very  general 
method  for  studying  the  extent  to  which  some  statistic  can  differentiate 
between  two  groups  is  the  Receiver  Operating  Character istic  (ROC)  curve. 


Thus,  we  can  study  index  standardization  by  using  an  ROC  curve  to  determine 
whether  the  index  distinguishes  between  groups  of  normal  examinees  who  differ 
in  ability. 

An  ROC  curve  is  obtained  by  specifying  a  cutting  score  t  for  an  index  and 
then  computing 


it (_t )  =  proportion  of  group  1  (say,  normal,  low  ability 

examinees)  response  patterns  with  index  values  less 
than  t^  (assuming  that  small  index  values  indicate 
aberrance) ; 

y(t)  =  proportion  of  group  2  (say,  normal,  high  ability 

examinees)  response  patterns  with  index  scores  less 
than  t. 

An  ROC  curve  consists  of  the  points  (x(t. )  ,y(.t) )  obtained  for  various  values  of 
_t.  The  proportion  x.(]L)  i-3  called  the  false  alarm  rate,  and  y(t.)  is  called  the 
hit  rate.  A  detailed  example  of  the  construction  of  an  ROC  curve  is  given  by 
Hulin,  Drasgow,  and  Parsons  (1983,  pp.  131-134). 

An  appropriateness  index  is  well-standardized  across  two  ability  levels 
if  the  ROC  curve  lies  along  the  diagonal  line  y.  =  jt. 

Method 

Polychotomous  item  responses  (five-option  multiple-choice  items  with 
omitting  allowed)  were  simulated  using  the  histograms  constructed  by  Levine 
and  Drasgow  (1983).  They  used  the  three-parameter  logistic  model  to  estimate 
the  abilities  of  49,470  examinees  from  the  85-item  April  1975  administration 
of  the  SAT-V.  Then  the  examinees  were  sorted  into  25  groups  on  the  basis  of 
estimated  ability.  The  4th,  8th,  — ,  96th  percentiles  of  the  normal  (0,1) 
distribution  were  used  as  cutting  scores  when  sorting  examinees.  Then  the 
proportions  of  examinees  choosing  each  option  (treating  skipped  and  not- 
reached  items  as  a  single  response  category)  were  computed  for  each  of  the  25 
ability  groups.  Probabilities  of  option  choices  were  then  computed  by  linear 
interpolation  between  category  medians  (i.e.,  the  2nd,  6th,  ...,  98th 
percentiles  from  the  normal  (0,1)  distribution). 

Five  samples  of  normal  response  patterns  were  generated  by  first  sampling 
3,000  numbers  (9's)  from  the  normal  (0,1)  distribution  truncated  to  the 
(-2.05,  2.05]  interval.  (It  was  necessary  to  truncate  the  ability 
distribution  because  interpolation  below  the  2nd  percentile  or  above  the  98th 
percentile  was  not  possible  with  the  histograms.)  Then  low  [-2.05  to  -1.50), 
moderately  low  (-.70  to  -.55),  average  (-.05  to  .05),  moderately  high  (.55  to 
.70),  and  high  (1.49  to  2.05]  9  samples  of  N  =  200  each  were  formed. 

Polychotomous  item  response  vectors  were  then  generated  for  each  9  value. 
For  each  item,  the  associated  histogram  was  used  to  compute  the  conditional 
(given  9)  probabilities  of  the  six  possible  responses  (treating  skipped  and 
not-reached  as  the  sixth  response).  A  number  was  sampled  from  the  uniform 
distribution  on  the  unit  interval,  and  a  simulated  response  was  obtained  by 
determining  where  the  random  number  was  located  in  the  cumulative  distribution 
corresponding  to  the  conditional  probabilities. 


Finally,  each  of  the  11  practical  appropriateness  indices  was  computed 
for  each  response  vector  in  each  sample.  Then  ROC  curves  were  computed  for 

each  of  the  (^)  =  10  possible  pairs  of  samples  and  each  of  the  11 

appropriateness  indices. 

Results 


Figures  1  through  3  present  the  results  for  the  low-average,  average- 
high,  and  low-high  comparisons.  The  results  for  the  other  seven  comparisons 
were  consistent  with  the  trends  seen  in  these  three  figures;  consequently, 
they  will  not  be  presented.  Furthermore,  only  the  lower  left  quarter  of  the 
ROC  curve  is  plotted  because  it  is  unlikely  that  anyone  would  set  a  cutting 
score  that  yielded  a  false  alarm  rate  of  more  than  50^. 

In  Figure  1,  it  is  evident  that  N I ,  IOV,  S,  and  B  are  poorly 
standardized.  This  result  is  not  surprising  because  no  explicit  steps  were 
taken  to  standardize  these  indices.  The  standardizations  of  the  z,,  FI,  F2, 
JK,  and  0/E  indices  seem  reasonably  good  across  low  8  and  average  9  groups. 

The  standardization  of  T2  and  T4  seem  somewhat  less  adequate,  although  T2  is 
well  standardized  for  false  alarm  rates  of  less  than  .20. 

The  pattern  of  results  in  Figure  2  is  somewhat  different  from  the  results 
in  Figure  1.  In  both  figures,  NI,  IOV,  and  B  are  poorly  standardized,  and  z,, 
F2,  JK,  and  0/E  are  again  well  standardized.  But  FI  is  much  less  well 
standardized  in  Figure  2.  In  contrast,  the  results  for  S  and  T4  have  improved 
considerably.  The  standardization  of  T2  was  better  in  Figure  1. 

Finally,  Figure  3  presents  the  results  comparing  the  low  0  normals  to  the 
high  9  normals.  The  pattern  of  results  indicates  that  this  comparison  is  the 
most  severe  test  of  standardization.  Note  that  at  low  misclassification 
rates,  only  z,,  F2,  and  JK  have  ROC  curves  near  the  diagonal.  The 
standardizations  of  NI,  IOV,  FI,  B,  and  S  are  all  poor.  T4  seems  standardized 
somewhat  better  than  T2. 


Power 


Problem 


Do  any  of  the  well-standardized  practical  appropriateness  indices  have 
adequate  power  for  detecting  some  form  of  aberrance?  Are  any  nearly  as 
powerful  as  the  index  that  is  optimal  for  the  given  form  of  aberrance? 

Method 

Data  Sets.  A  test  norming  sample  of  3,000  response  vectors  was  created 
by  sampling  3,000  numbers  (9s)  from  the  normal  (0,1)  distribution  truncated  to 
the  [-2.05,2.05]  interval.  A  normal  sample  of  4,000  response  vectors  was  also 
generated  in  this  way.  Two  thousand  aberrant  response  vectors  were  created  in 
each  of  12  conditions.  The  12  conditions  resulted  from  varying  three  factors: 
the  type  of  aberrance  (spuriously  high;  spuriously  low),  the  severity  of 
aberrance  (mild;  moderate),  and  the  distribution  from  which  simulated 
abilities  were  sampled. 
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High  O  Normals 

Figure  2.  ROC  curves  obtained  from  200  normal  average  0  response 
vectors  and  200  normal  high  0  response  vectors. 
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Six  of  the  aberrant  samples  contained  spuriously  high  response  vectors, 
and  the  remaining  six  samples  contained  spuriously  low  response  vectors. 
Spuriously  high  response  patterns  were  created  by  first  generating  normal 
response  vectors  ( polychotomously  scored)  and  then  replacing  a  given 
percentage  k  of  simulated  responses  (randomly  sampled  without  replacement) 
with  correct  responses.  Spuriously  low  response  patterns  were  also  created  by 
first  generating  normal  response  vectors.  Then  a  fixed  percentage  of  items 
were  randomly  selected  without  replacement,  and  the  responses  to  these  items 
were  replaced  with  random  responses  (i.e.,  a  response  was  replaced  by  option  A 
with  probability  .2,  by  option  B  with  probability  .2,  ...,  and  by  option  E 
with  probability  .2).  Mildly  aberrant  response  patterns  were  generated  by 
using  k  =  15%.  Moderately  aberrant  response  patterns  were  created  using  k  = 
30%. 

The  third  variable  manipulated  was  the  ability  level  of  the  aberrant 
sample.  Abilities  for  the  spuriously  high  samples  were  sampled  from  three 
parts  of  the  normal  (0,1)  distribution  truncated  to  [-2.05,2.05]:  very  low 
(0th  through  9th  percentiles),  low  (10th  through  30th  percentiles),  and  low 
average  (31st  through  48th  percentiles).  In  all  cases,  percentile  points  were 
determined  after  the  truncation  to  [-2.05,  2.05).  These  intervals  were  used 
because  it  is  more  important  to  detect  spuriously  high  response  patterns  for 
low  ability  examinees  than  for  high  ability  examinees.  Similarly,  it  is  more 
important  to  detect  spuriously  low  responses  for  high  ability  examinees. 
Consequently,  abilities  were  sampled  from  three  above-average  ability  strata 
for  the  spuriously  low  samples:  very  high  (93rd  percentile  and  above),  high 
(65th  through  92nd  percentiles),  and  high  average  (49th  through  64th 
percentiles).  The  ability  percentiles  used  here  correspond  to  the  percentiles 
forming  AFQT  categories. 

Table  1  summarizes  the  12  samples  of  aberrant  response  vectors.  Each  of 
these  24,000  (=12  *  2,000)  response  vectors  was  independently  generated. 

Analysis.  All  the  item  and  test  statistics  required  to  compute  the 
practical  appropriateness  indices  were  computed  using  the  test  norming  sample. 
These  quantities  were  computed  as  the  first  step  in  the  analysis  and  then  used 
in  all  subsequent  analyses.  L0GIST  (Wood,  Wingersky,  &  Lord,  1976)  was  used  to 
estimate  item  parameters  and  a  FORTRAN  program  was  written  to  compute  the 
other  quantities  required. 

Then  the  11  practical  appropriateness  indices  were  computed  for  the  4,000 
response  vectors  in  the  normal  (responding  appropriately)  sample.  The  item 
and  test  statistics  estimated  from  the  test  norming  sample  were  used  in  these 
calculations.  This  procedure  simulates  the  process  by  which  practical 
appropriateness  indices  would  be  computed  in  many  applications.  Four  optimal 
indices  were  also  computed  for  the  normal  sample:  15%  spuriously  high,  30% 
spuriously  high,  15%  spuriously  low,  and  30%  spuriously  low.  The  ability 
density  f  used  in  Equations  1  and  2  was  the  normal  (0,1)  density  truncated  to 
the  interval  [-2.05,  2.05].  The  histograms  used  to  generate  the  data  were 
also  used  to  compute  the  optimal  indices;  that  is,  polychotomous  option 
characteristic  curves  were  not  estimated.  (In  order  for  an  optimal  index  to 
be  truly  optimal  for  the  corresponding  form  of  aberrance,  it  is  necessary  to 
use  the  true  option  characteristic  curves.) 


Table  1 .  Ability  Distributions  Used  to 
Generate  Aberrant  Samples 


Percent  of 
aberrant  responses 

Type  of  aberrance 

Spuriously  high 

Spuriously  low 

15* 

N?[-2.05,-1 .34] 

Nt( 1.41,2.05] 

15% 

MT(-1. 34,-0. 52] 

(0.35, 1.41  ] 

15 1 

Nt( -0.52, -0.05] 

Nx(-0.05,0.35] 

30% 

N?[ -2 .05, -1 .34 ] 

Nt( 1.41,2.05} 

30% 

Nt(-1.34,-0.52] 

Nt(0.35,1.41] 

30% 

Nt( -0.52, -0.05] 

Nt(-0.05,0.35] 

Note.  NT(a,b]  is  used  to  denote  the  standard  normal  distribution 
truncated  to  the  interval  (a,b].  Parentheses  are  used  to  indicate  interval 
endpoints  that  were  not  included  in  the  interval  and  brackets  are  used  to 
indicate  interval  endpoints  that  were  included  in  the  interval. 
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The  11  practical  appropriateness  indices  were  computed  for 
aberrant  samples.  In  addition,  the  15%  spuriously  high  optimal 
computed  for  the  three  samples  with  this  form  of  aberrance;  the 
high  optimal  index  was  computed  for  the  three  samples  with  this 
aberrance;  etc. 


each  of  the  12 
index  was 
30%  spuriously 
form  of 


Note  that  the  ability  density  used  in  Equations  1  and  2  does  not  match 
the  ability  density  of  any  aberrant  sample.  The  proper  interpretation  of  the 
optimal  index  is  the  following:  It  is  the  optimal  index  for  the  specified 
form  of  aberrance,  say  15%  spuriously  high,  in  a  population  where  the  ability 
density  is  normal  (0,1)  truncated  to  [-2.05,  2.05]  for  both  the  normal  and 
aberrant  populations  and  a  response  vector  is  either  normal  or  15%  spuriously 
high.  The  normal  group  does  in  fact  have  this  ability  distribution.  By 
restricting  the  abilities  of  the  aberrant  group  to  a  subinterval  of  [-2.05, 
2.05],  we  determined  the  power  in  a  particular  subpopulation  of  the  index  that 
is  optimal  for  the  population  as  a  whole. 


Evaluation  Criteria.  The  main  criteria  used  for  evaluating  the 
appropriateness  indices  were  the  proportions  of  aberrant  response  patterns 
that  were  correctly  identified  as  aberrant  when  various  proportions  of  normal 
response  patterns  were  misclassified  as  aberrant.  These  proportions  were 
determined  for  all  12  aberrance  conditions.  This  allowed  us  to  determine  what 
types  of  aberrant  response  patterns  had  acceptably  high  detection  rates  using 
optimal  methods  and  using  practical  methods.  The  characteristics  of  response 
patterns  that  cannot  be  detected  became  evident  as  a  consequence  of  examining 
the  12  aberrance  conditions  separately. 

Results 


Before  presenting  the  results  for  the  12  aberrant  samples,  we  shall 
illustrate  some  problems  caused  by  poorly  standardized  appropriateness 
indices.  Table  2  presents  detection  rates  for  the  15%  spuriously  high 
aberrant  sample  for  two  different  samples  of  normal  responses.  In  one  case, 
the  normal  sample  consists  of  the  200  response  vectors  with  the  highest  9 
values  from  the  normal  sample  of  N_=  4,000  previously  described;  in  the  other 
case,  the  normal  sample  consists  of  the  200  response  vectors  with  the  lowest  0 
value.  (Results  for  B  are  not  given  because  this  index  was  not  programmed  in 
its  final  form  when  this  table  was  constructed.) 

As  shown  in  Table  2,  the  I0V  index  seems  to  be  fantastic  when  the  normal 
group  consists  of  high  ability  normals:  It  correctly  identified  every  single 
aberrant  response  vector,  without  a  single  misclassification  of  a  normal.  The 
S  index  appeared  to  be  an  excellent  index,  although  not  as  powerful  as  I0V. 

In  contrast,  FI  seemed  to  be  an  abysmally  poor  index. 


These  results  were  almost  completely  contradicted  for  the  low  ability 
normals.  At  a  1%  false  alarm  rate,  the  detection  rate  of  the  I0V  index  was 
10%  when  the  normal  group  consisted  of  low  ability  response  patterns;  it  was 
100%  when  the  normals  were  high  ability.  The  comparable  rates  for  S  were  78% 
and  8%,  respectively.  The  results  for  FI  were  in  the  opposite  direction:  The 
detection  rate  was  0%  for  normals  of  high  ability  but  34%  for  normals  of  low 
ability. 
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Table  2.  Selected  ROC  Curve  Points  for  the 
15%  Spuriously  High  Treatment,  Aberrant 
Response  Patterns  Generated  from  0-9J  Ability  Range 


Proportion  detected  by 


rate 


Normal  Grou 


T2 

T4 

IOV 

0/E 

JK 

NI 

Note,  z,  =  standardized  2,;  FI  r  fit  statistic  1;  F2  =  fit  statistic 
2;  T2  =  second  standardized  extended  caution  index;  T4  =  fourth  extended 
standardized  caution  index;  IOV  =  item-option  variance;  0/E  =  observed 
information  divided  by  expected  information;  JK  =  normalized  jackknife 
estimate  of  variance;  NI  =  number  of  Newton-Raphson  iterations 


The  differences  in  detection  rates  for  FI,  S,  and  IOV  resulted  from  their 
poor  standardizations.  In  contrast,  the  well-standardized  z,  had  detection 
rates  of  47X  and  44X  at  a  It  misclassification  rate.  F2  also  had  similar 
detection  rates:  34t  and  36% .  T2  is  not  standardized  as  well  as  T4 ;  however, 
the  detection  rates  for  T2  were  higher  than  the  rates  for  T4 .  0/E  and  JK  had 

moderately  dissimilar  detection  rates  across  the  two  sets  of  normals. 

Finally,  the  detection  rates  for  NI  were  identical  across  conditions; 
unfortunately,  the  detection  rates  were  exceedingly  poor. 

The  results  for  the  15X  and  30X  spuriously  high  samples  for  the  low 
ability  range  (Oth  through  9th  percentiles)  are  shown  in  Table  3.  In  this 
case,  the  normal  group  consisted  of  4,000  response  vectors  that  were  generated 
from  0  values  sampled  from  the  standard  normal  distribution  truncated  to 
[-2.05,2.05].  Note  that  the  detection  rates  for  z,,  F2,  and  T2  were  fairly 
close  to  the  rates  for  LR.  It  is  clear  from  Table  5  that  the  30 %  spuriously 
high  treatment  is  very  detectable:  LR,  z,,  and  T2  all  had  detection  rates  of 
90%  or  more  when  the  error  rate  was  \% .  Even  the  relative  moderate  1 5% 
spuriously  high  treatment  (which  affected  at  most  13  items  on  the  85-item 
test)  was  fairly  detectable:  LR  and  z,  had  detection  rates  of  50  and  46*  at  a 
IX  error  rate.  0/E  and  JK,  which  were  shown  to  be  well  standardized  in  the 
previous  section  of  this  paper,  had  little  power.  At  a  it  error  rate,  0/E  and 
JK  detected  only  22t  and  33t  of  the  30t  spuriously  high  response  vectors. 

Table  4  presents  the  results  for  the  15f  and  30t  spuriously  high 
treatment  applied  to  the  moderately  low  ability  range  (10th  through  30th 
percentiles).  It  should  be  more  difficult  to  detect  aberrant  response  vectors 
in  this  ability  range  than  in  the  low  ability  range  because  the  expected 
number  of  responses  changed  due  to  the  aberrance  manipulation  is  smaller. 
Surprisingly,  the  detection  rates  for  LR  did  not  decrease  sharply:  At  a  IX 
error  rate,  the  detection  rates  were  50X  versus  45X  for  15X  spuriously  high, 
and  93X  versus  89X  for  30X  spuriously  high.  The  detection  rates  declined  more 
rapidly  for  z,  (46X  vs.  30X  for  15%  spuriously  high;  90X  vs.  75%  for  301 
spuriously  high)  and  F2  ( 34%  vs.  21%;  85X  vs.  73% ) .  The  rates  of  decline  of 
T2  and  T4  were  intermediate.  T2  declined  from  37%  to  33%  for  1 5 %  spuriously 
high  and  from  91%  to  31%  for  the  10%  treatment.  T4  declined  from  30X  to  25X 
and  from  87X  to  79%  . 

The  trends  seen  in  Tables  3  and  4  continue  in  Table  5,  which  presents  the 
results  for  the  1 5X  and  30X  spuriously  high  treatments  applied  to  the  low 
average  ability  range  (31st  to  48th  percentiles).  As  shown  in  Table  5,  the  LR 
index  provided  detection  rates  that  are  roughly  50X  higher  than  those  of  the 
best  practical  indices.  For  example,  at  a  IX  error  rate  LR  had  a  detection 
rate  of  34 X  for  the  1 5X  treatment;  z,,  F2,  T2,  T4  had  detection  rates  of  18%, 
15%,  23X,  and  20X,  respectively.  The  detection  rates  were  78X  versus  5 1 X , 

53X,  5 IX  and  57X  for  the  30X  spuriously  nigh  condition  at  a  IX  error  rate. 

Table  6  presents  the  results  for  the  151  and  30X  spuriously  low  treatment 
applied  to  the  high  average  ability  sample  (between  the  49th  and  64th 
percentiles).  It  is  evident  that  the  practical  appropriateness  indices  are 
quite  ineffective  relative  to  the  optimal  index.  At  a  IX  error  rate,  LR  had  a 
47X  detection  rate  for  the  1 5X  treatment;  the  highest  rate  for  any  of  the 
practical  indices  was  only  16X.  The  pattern  of  results  for  the  30X  condition 
was  similar.  Here  the  LR  detection  rate  was  an  impressive  79X  when  the  err. : 


Table  4.  Selected  ROC  Curve  Points  for  the  Aberrant 
Response  Patterns  Generated  from  the  10-30%  Ability  Range 


Table  5-  Selected  ROC  Curve  Points  for  the  Aberrant 
Response  Patterns  Generated  from  the  11-^8%  Ability  Range 

False  Proportion  detected  by 

alarm  - — 

rate  LR  z,  FI  F 2  S  12  T4  IOV  0/E  JK 


Table  6.  Selected  HOC  Curve  Points  for  the  Aberrant 


Response  Patterns  Generated  from  the  49-64%  Range 


Table  7.  Selected  ROC  Curve  Points  for  the  Aberrant 
Response  Patterns  Generated  from  the  65-9 2*  Ability  Range 


False 
alarm  - 
rate 


Proportion  detected  by 


T2  T4  10 V  0/E 


rate  was  1%;  the  next  best  index  (z,)  detected  only  35%  of  the  aberrant 
sample. 

In  Table  7,  which  presents  the  results  for  the  15%  and  3051  spuriously  low 
samples  with  0s  in  the  65th  through  92nd  percentiles,  the  practical 
appropriateness  indices  have  detection  rates  that  are  closer  to  the  rates  of 
the  optimal  index.  This  trend  is  continued  in  Table  8,  which  presents  the 
results  for  the  spuriously  low  treatments  applied  to  the  highest  ability 
category  (percentiles  93  and  above).  At  a  1%  error  rate,  for  example,  LR 
detected  81 %  of  the  15 %  spuriously  low  response  patterns;  z,,  F2,  and  T2  had 
detection  rates  of  72%,  62%,  and  54%.  For  the  30%  treatment,  the  rate  for  LR 
was  97%;  z,,  F2,  and  T2  had  rates  of  95%,  91%,  and  94%. 

Drasgow  and  Guertler  (1987)  recently  presented  a  utility  theory  approach 
to  the  use  of  Appropriateness  Measurement  in  practical  settings.  Their 
approach  requires  the  densities  of  an  index  in  normal  and  aberrant  samples. 
Consequently,  normal  distributions  were  fitted  to  the  distributions  of  z,,  F2, 
and  T4  by  equating  the  first  two  moments  of  the  normal  distribution  to  the 
empirical  moments.  These  analyses  were  based  on  the  first  1,000  response 
vectors  from  the  normal  sample  and  each  of  the  12  aberrant  samples.  The 
fitted  means  and  standard  deviations  are  presented  in  Table  9.  As  a  crude 
measure  of  fit,  Kolmogorov-Smirnov  test  statistics  were  computed  to  compare 
the  empirical  distributions  to  normal  distributions  with  the  observed  moments. 
No  significant  (a  =  .05)  departures  of  empirical  distributions  from  the 
corresponding  fitted  normal  distribution  were  found.  As  the  Kolmogorov- 
Smirnov  test  can  be  conservative  when  fitted  moments  are  substituted  into  the 
theoretical  distribution  (Massey,  1951),  these  results  should  be  viewed  with 
some  caution. 

Discussion 

There  has  been  a  growing  interest  in  Appropriateness  Measurement,  both  by 
researchers  and  by  testing  practitioners.  To  date,  however,  there  has  been 
little  critical  study  of  the  various  indices  available.  The  results  of  the 
research  surnnarized  here  clearly  indicate  that  there  are  important  differences 
in  the  properties  of  appropriateness  indices.  Figures  1  through  3  show  that 
some  indices  are  poorly  standardized  (e.g.,  I0V),  and  a  "standardized"  index 
may  not  be  well  standardized  (e.g.,  FI).  Table  2  illustrates  the  problems 
that  are  caused  by  poorly  standardized  indices. 

A  well-standardized  index  is  not,  however,  necessarily  a  good 
appropriateness  index.  The  0/E  and  JK  indices  were  shown  to  be  reasonably 
well  standardized  in  Figures  1  through  3,  but  Tables  3  through  8  clearly  show 
them  to  be  ineffective  in  detecting  aberrant  response  patterns. 

Perhaps  the  most  important  finding  of  the  simulation  reported  in  this 
chapter  is  that  zJt  F2,  and  T2  provide  nearly  optimal  rates  of  detection  of 

some  forms  of  aberrance  but  inadequate  rates  of  detection  of  other  forms  of 

aberrance.  In  particular,  these  three  indices  have  near-optimal  rates  of 
detection  when  the  spuriously  high  treatment  is  applied  to  very  low  ability 
response  vectors  and  when  the  spuriously  low  treatment  is  applied  to  very  high 
ability  response  vectors.  Unfortunately,  these  indices  have  rates  of 
detection  far  below  optimal  when  the  spuriously  high  and  low  treatments  are 

applied  to  response  vectors  with  nearly  average  ability  values. 


Table  9.  Means  and  Standard  Deviations  of  Empirical 
Distributions  of  z3,  F2,  and  T4 


Aberrance 
man i du la t ion 


Spur.  High 

Spur.  High 

Spur.  High 

Spur.  Low 

Spur.  Low 

Spur.  Low 

Normals3 


Ability 

ranee 


0-9% 

10-30% 

31-48% 

49-64% 

65-92% 

93-100% 

0-100% 


Severity  of  aberrance 
15%  30 


2  3 

F2 

T4 

-2 

.32 

1 

.28 

1 

.56 

-4 

.00 

(1 

.13) 

(0 

.14) 

(0 

.94) 

(1 

.22) 

-1 

.85 

1 

.23 

1 

.39 

-3 

.32 

(1 

.11) 

(0 

.14) 

(0 

.98) 

(1 

.19) 

-1 

.38 

1 

.19 

1 

.22 

-2 

.47 

(1 

.03) 

(0 

.14) 

(1 

.02) 

(1 

.21) 

-1 

.02 

1 

.13 

0 

.65 

-1 

.58 

(1 

.03) 

(0 

.13) 

(0 

.99) 

(1 

.14) 

-1 

.85 

1 

.23 

1 

.17 

-2 

.74 

(1 

.16) 

(0 

.16) 

(1 

.11) 

(1 

.19) 

-3 

.01 

1 

.37 

1 

.78 

-4 

.28 

(1 

.30) 

(0 

.17) 

(1 

.14) 

(1 

.32) 

0 

.09 

0 

.99 

-0 

.  14 

(0 

.97) 

(0 

.12) 

(0 

.86) 

3.22 
(1.07) 
3.04 
(1.10) 
2.38 
(1.19) 
1 .20 
(0.98) 
2.12 
(1.08) 
3.50 
(1.24) 


Note .  Means  and  standard  deviations  are  based  on  samples  of  N  =  100 
Standard  deviations  are  in  parentheses. 

^o  conserve  space,  results  for  the  normal  sample  are  listed  under  the 
columns  for  the  15%  severity  of  aberrance. 


These  results  indicate  that  we  need  to  devise  new  indices  that  are  more 
powerful  than  z,,  F2,  and  T2  for  examinees  whose  abilities  are  near  average. 
We  expect  that  it  may  be  necessary  to  construct  two  indices:  one  for 
spuriously  low  response  patterns  and  one  for  spuriously  high  response 
patterns.  This  psychometric  necessity  would  be  quite  useful  for  practitioners 
because  it  would  allow  them  to  diagnose  the  cause  of  aberrance  in  addition  to 
detecting  aberrant  response  patterns. 


III.  POLYCHOTOMOUS  ANALYSIS  OF  THE  ARITHMETIC  REASONING  TEST: 
AN  APPLICATION  OF  MULTILINEAR  FORMULA  SCORE  THEORY 


Introduction 

Multilinear  formula  score  theory  or  multilinear  formula  scoring  (MFS; 
Levine,  1983,  1985a,  1985b)  is  a  nonparametr ic  IRT  for  which  consistent  and 
asymptotically  efficient  estimators  of  ability  densities,  item  characteristic 
curves  (ICCs),  and  option  characteristic  curves  (OCCs)  have  been  derived  and 
programmed.  MFS  provides  a  powerful  new  approach  to  substantive  questions  of 
long  standing.  These  questions  include  determining  the  shapes  of  ability 
distributions  and  the  magnitudes  of  differences  among  ability  distributions  of 
various  groups,  determining  the  shapes  of  item  characteristic  curves  for 
unidimensional  and  multidimensional  tests,  identifying  biased  and  other  faulty 
items,  and  assessing  the  extent  to  which  two  tests  measure  the  same  ability. 

In  the  research  reported  this  chapter,  we  used  three-parameter  logistic 
ICCs  to  model  the  way  in  which  examinees  respond  to  correct  options  of  AR 
multiple-choice  items  and,  simultaneously,  we  used  MFS  to  model  responses  to 
the  incorrect  options.  Thus,  we  replaced  the  crude  "histogram  model"  of 
Chapter  II  with  a  theory-based  approach.  Consequently,  low  rates  of  detection 
of  inappropriate  response  patterns  cannot  be  attributed  to  an  unsophisticated 
analysis  of  the  data. 

Prior  to  determining  rates  of  detection  of  spuriously  high  and  low 
response  patterns,  we  examined  MFS's  ability  to  estimate  option  response 
curves.  The  results  of  this  analysis  were  assessed  graphically  and  by 
determining  the  increase  in  information  about  ability  due  to  polychotomous 
scoring  of  item  responses.  The  term  "information"  is  used  in  its  statistical 
sense  to  mean  the  expected  squared  derivative  of  the  logarithm  of  the 
likelihood  function.  Since  the  asymptotic  standard  error  of  the  maximum 
likelihood  estimate  of  an  ability  0  equals  the  square  root  of  the  reciprocal 
of  the  information  function  at  0,  an  increase  in  information  due  to 
polychotomous  scoring  is  readily  translated  into  percent  test  length  reduction 
made  possible  by  polychotomous  scoring. 

We  also  compared  the  dichotomous  and  polychotomous  item  response  models' 
potentials  for  supporting  Appropriateness  Measurement.  Of  course,  the  model- 
based  detectability  of  a  particular  type  of  aberrance  depends  upon  the  item 
response  model  used  to  analyze  the  data;  more  specific  (polychotomous)  models 
are  expected  to  be  rejected  more  frequently  when  fitted  to  aberrant  response 
patterns  and  thus  provide  superior  appropriateness  measurement.  By  combining 
the  optimal  appropriateness  index  results  of  Levine  and  Drasgow  (1989,  1987) 


with  MFS's  ability  to  accurately  recover  the  option  characteristic  curves 
needed  for  polychotomous  modeling,  we  determined  whether  polychotomous 
modeling  was  negligibly  or  markedly  superior  to  dichotomous  modeling  in 
detecting  test  anomalies. 

This  chapter  also  contributes  to  formula  score  theory  in  that  it  provides 
a  verification  of  MFS  theoretical  results  with  simulation  data. 


Review  of  Multilinear  Formula  Score  Theory 


This  section  contains  a  review  of  MFS  theory  as  it  is  used  in  this  paper. 
The  theory  is  more  general  than  outlined  here,  but  for  the  sake  of  clarity,  we 
will  describe  only  the  special  case  required  for  the  present  research. 

Let  u .  denote  the  response  to  the  _ith  item  of  an  n  item  test  scored  u.  ^ 

1  if  correct  and  u  =  0  if  incorrect.  The  u.  generate  the  elementary  formula 
scores ,  which  can  be  enumerated  as 


-T  -2’  •••’ 

—1—2’  -1-3’  • - ‘ ’  ^n-l^n 

u.u„  . . .  u  . 

-1-2  -n 

Traditional  formula  scoring  (Lord  &  Novick,  1968,  Chapter  14)  generally 
uses  only  linear  scores.  When  there  is  neither  omitting  nor  polychotomous 
scoring,  linear  formula  scores  are  formulas  with  a  constant  term  plus  a  linear 
combination  of  the  binary  item  scores,  u 1 ,  u^,  ...,  u^ .  (When  there  is 

omitting  and  polychotomous  scoring,  a  linear  score  is  a  constant  plus  a  linear 
combination  of  binary  variables  indicating  omitting  and  option  choice.) 


Multilinear  formula  score  theory  generalizes  traditional  formula  score 
theory  by  using  quadratic  scores  (linear  scores  added  to  linear  combinations 
of  u_2 *  Hi— 3>  •  •  •  *  iH^)’  cubic  scores  (quadratic  scores  plus  linear 

combinations  of  products  of  item  scores  for  three  different  items),  and  higher 
order  scores.  Most  of  the  results  in  this  chapter  were  obtained  with  fifth 
order  scores.  The  new  theory  is  called  "multilinear"  because  frequent  use  is 
made  of  the  fact  that  when  all  the  scores  except  one  are  held  constant,  a 
"linear"  score  is  obtained. 

In  this  chapter,  as  in  Chapter  II,  we  assume  that  the  regression  of  u.  on 

the  latent  trait  8  is  a  three-parameter  logistic  ogive.  By  local 
independence,  the  regressions  of  the  elementary  formula  scores  on  the  latent 
trait  can  then  be  written  as 


P  ,  ( t)  ,  P2(t),  ....  P^t) 

P,(t)P2(t),  P,(t)P3(t),  ...,  P^UlP^t) 
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where  t  is  used  to  denote  a  specific  value  of  9. 


There  are  2  regression  functions  listed  above.  More  can  be  generated  by 
taking  linear  combinations  of  the  elementary  formula  scores  and  then  computing 
their  regressions  on  the  latent  trait.  For  example,  the  number-right  score 

X=u,+u_+...+u 
-  - 1  -2  — n 


has  the  regression 

n 

E(X  I  t)  =  L  P.(t)  . 
i  =  1  1 

The  collection  of  regression  functions  of  all  linear  combinations  of 
elementary  formula  scores  is  called  the  canonical  space  (CS)  of  a  test. 

A  major  step  in  an  MFS  analysis  of  a  test  consists  of  finding  a  smaller 
number  of  functions  to  represent  the  large  number  (in  fact,  an  infinite 
number)  of  functions  in  the  canonical  space.  The  smaller  collection  of 
functions  is  called  an  orthonormal  basis  for  the  canonical  space. 

Selecting  an  orthonormal  basis  for  the  canonical  space  is  analogous  to 
finding  the  principal  components  of  a  set  of  variables.  In  a  principal 
components  analysis,  the  basic  idea  is  to  create  a  new  set  of  variables,  the 
principal  components,  so  that  each  of  the  original  variables  can  be  written  as 
a  linear  combination  of  the  principal  components  plus  a  small  residual.  A 
principal  components  analysis  is  valuable  when  there  is  a  large  number  of 
original  variables  and  the  first  few  principal  components  explain  almost  all 
of  their  variance.  In  the  same  way,  functions  in  the  canonical  space  are 
written  as  linear  combinations  of  the  orthonormal  basis  functions.  For 
example,  the  ICC  for  the  i^th  item  can  be  written 

K 

£.«(£>  =  E  akVi)  - 

k=  1 


where  K  functions,  denoted  j^U), 


hj.(t),  are  used  in  the  orthonormal 

l\ 


basis  and  the  ak  are  the  weights  used  in  the  linear  combination.  If  K  is 


sufficiently  large,  this  representation  is  exact.  If  only  the  first 
functions  are  used,  instead  of  all  K  functions  (where  J_  is  less  than  K ) ,  then 
there  is  some  error.  However,  the  residual 


will  be  small  if  the  aR  are  small  for  values  of  k  larger  than  J.  In  fact,  the 

2  2  ? 

area  under  the  squared  residual  is  exactly  a,  .  ♦  a,  „  «-...+  a.  ,  . 

J+1  J+2  K 

In  each  MFS  analysis,  a  parsimonious  representation  of  one  or  another 
collection  of  functions  in  the  CS  is  important.  MFS  provides  techniques  that 
yield  basis  functions  that  give  small  values  of  for  large  values  of  i<,  at 

least  for  the  collection  of  functions  being  analyzed.  Most  MFS  analyses 
require  six  to  eight  basis  functions  for  an  adequate  representation  of  the 
functions  being  studied. 

To  recapitulate,  the  analysis  begins  by  estimating  ICCs  from  the 
dichotomously  scored  item  responses.  Widely  available  programs  such  as  LOGIST 
(Wood,  Wingersky,  &  Lord,  1976)  and  BILOG  (Mislevy  &  Bock,  1983)  can  be  used 
to  this  end.  The  estimated  ICCs  and  the  assumption  of  local  independence  are 
subsequently  used  to  define  the  canonical  space.  Then  a  small  number  of 
orthonormal  basis  functions  are  selected  so  that  the  functions  in  the 
canonical  space  are  well  approximated  by  linear  combinations  of  the 
orthonormal  basis  functions. 


The  next  step  of  the  MFS  analysis  is  to  determine  weights  for  the 
orthonormal  basis  functions  so  that  option  characteristic  curves  (OCCs)  can  be 
written  as  linear  combinations  of  the  h^s.  Since  OCCs  were  not  included  in 

the  set  of  functions  used  to  define  the  canonical  space,  we  must  address  both 
the  mathematical  question  of  how  best  to  approximate  the  OCCs  by  basis 
functions  and  the  substantive  question  of  whether  or  not  the  basis  functions 
can  adequately  approximate  OCCs.  The  OCC  analysis  proceeds  item-by- item,  with 
the  weights  for  all  the  options  (including  omit  as  an  option)  to  each  item 
simultaneously  estimated  by  the  method  of  marginal  maximum  likelihood.  The 
log  likelihood  that  is  maximized  with  respect  to  the  weights  is 

N 

L  :  I  log  P(u.,  V  )  ,  (  17 ) 

J=1  J  J 


where  u^  is  a  vector  containing  the  dichotomously  scored  item  responses  of  the 

j_th  examinee  and  indicates  the  particular  option  on  item  i_  selected  by 

examinee  j_.  For  a  four-option  multiple-choice  item,  v.j  =  1  if  option  A  is 

selected,  ...,  v  =  4  if  option  D  is  selected,  'and  v^j  =  5  if  no  response  is 

made.  Suppose  all  the  items  are  recoded  such  that  option  A  is  always  the 
correct  response.  Then  Equation  17  can  be  rewritten  as 


N 

L  =  L  log  P( u  )  + 

J  =  1  J 
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where 


L  log  P(u  I  t)  P(v  |  t,  u.  =0)f(t)dt 
j  - 1  1  J  1 J  -1 J 

v-j,1 


P(u  It):  D  P.(t)  J[  1  -  P  (t) 
J  i  =  1  1  1 


PIVjj  1  t,  Hlj  =  °)  =  a^U)  ,  (20) 

and  f(t)  is  the  ability  density.  Notice  that  Equation  19  is  the  likelihood 
function  for  the  three-parameter  logistic  model  (i.e.,  Lord's  (1980)  Equation 
4-20  and  Hulin  et  al . ' s  (1983)  Equation  2.6.2).  It  is  the  aks  in  Equation  20 

that  are  to  be  estimated.  Actually,  each  option  has  its  own  set  of  J  a^s,  but 

to  avoid  notational  complexity,  we  have  not  added  another  subscript  to  the 

v- 

It  is  important  to  observe  that  local  independence  is  not  used  to  derive 
Equation  18  from  Equation  17;  only  the  definition  of  conditional  probability 
is  used.  Thus,  even  when  skipping  items  or  not  reaching  items  (response  "5") 
fails  to  obey  the  assumption  of  local  independence,  an  accurate  estimate  of 
the  conditional  probability  of  non-response  for  examinees  at  each  ability 
level  is  obtained. 

Quadratic  programing  methods  are  used  to  obtain  maximum  likelihood 
estimates  of  orthonormal  basis  function  weights  for  conditional  option 
characteristic  curves  (COCCs)  in  Equation  20.  A  C0CC  equals  its  associated 
OCC  divided  by  [1-P.(9)];  hence,  the  COCCs  for  an  item  sum  to  1  for  all  9 

values.  The  OCCs  for  an  item,  in  contrast,  sum  to  (l-j\(9)),  which  becomes 

very  small  as  £^9)  approaches  1.  The  weights  for  the  COCCs  are  easier  to 

estimate  than  the  weights  for  OCCs  since  the  OCCs  for  easy  items  and  for 
rarely  chosen  options  are  close  to  0,  which  causes  the  to  become 

indeterminant;  COCCs  are  not  usually  close  to  0.  Because  the  OCC  at  9  =  t  is 
equal  to  the  COCC  multiplied  by  1  -  £^(£),  the  OCCs  are  available  after  the 

COCCs  have  been  obtained.  The  COCCs  are  intrinsically  interesting  as  well  as 
mathematically  tractable  since  their  shapes  can  be  used  to  study  the 
properties  of  effective  distractors. 

The  quadratic  programming  methods  used  by  Levine  and  Williams  (1985)  are 
convenient  because  they  allow  plausible  constraints  to  be  placed  on  the  COCCs. 
One  constraint  is  positivity :  COCCs  are  not  allowed  to  become  negative.  In 
the  present  analyses  all  COCCs  were  required  to  equal  or  exceed  .001.  A 
second  constraint  placed  on  COCCs  is  smoothness :  The  COCCs  were  not  allowed 
to  oscillate  widely.  The  smoothness  constraint  was  implemented  by  restricting 


the  third  derivative  of  the  COCCs  to  be  less  than  ,uu5.  This  condition  can  De 
thought  of  as  requiring  each  small  piece  of  the  graph  of  the  COCC  to  have  a 
very  accurate  quadratic  approximation.  (A  restriction  on  the  second 
derivative  would  force  the  COCC  to  be  locally  linear,  and  a  first  derivative 
constraint  would  force  the  COCC  to  be  locally  constant.) 

Estimation  and  Information 

Data  set.  The  data  set  used  in  our  analyses  was  a  spaced  sample  of  2,978 
examinees  taken  from  the  National  Opinion  Research  Center  (NORC;  Bock  4 
Mislevy,  1981)  sample  of  American  youths.  These  examinees  answered  the  30- 
item  ASVAB  Arithmetic  Reasoning  (AR)  subtest.  Each  item  on  this  test  has  four 
options . 

ICC  estimation.  The  first  step  in  the  MFS  analysis  was  to  estimate 
ICCs  from  the  dichotomously  scored  item  responses.  To  this  end,  the  item 
responses  of  the  examinees  described  above  were  scored  dichotomously.  All 
nonanswered  items  were  scored  as  incorrect  (since  we  treated  omits  as  a 
separate — and  incorrect--response  option).  Then  version  2B  of  LOGIST  (Wood 
et  al.,1976)  was  used  to  estimate  item  and  ability  parameters.  Estimates  of 
item  discrimination  parameters  ranged  from  about  0.5  to  2.0,  and  estimates  of 
item  difficulties  varied  from  about  -3.0  to  1.4  (mean  =  .14,  SD  =  .99). 

Density  estimation.  The  ability  density  £  shown  in  Equation  18  was 
estimated  by  the  nonparametr ic  method  developed  by  Levine  and  Williams  (1985). 
The  density  was  represented  as  a  linear  combination  of  basis  functions,  and 
the  weights  were  estimated  by  maximum  likelihood.  The  weight  vectors  were 
restricted  to  a  convex  set  determined  by  hypotheses  about  the  shape  of  the 
unknown  density.  After  experimenting  with  various  shape  hypotheses,  the 
following  conditions  were  selected.  The  density  was  constrained  to  be 
nonnegative;  to  have  a  nonnegative  second  derivative  between  -4.8  and  -3.1;  to 
have  a  nonpositive  second  derivative  for  abilities  between  -.3  and  1.0;  to  be 
monotonically  increasing  for  abilities  between  -3.1  and  -.3;  and  to  be 
monotonically  decreasing  for  abilities  between  1.0  and  3.5.  These  conditions 
imply  that  the  density  will  be  unimodai  between  -3.1  and  3.5,  that  the  mode 
will  occur  between  -.3  and  1.0,  and  that  the  density  will  either  decrease  to  a 
lower  asymptote  as  ability  decreases  to  -5  or  will  have  a  second  mode  in  the 
left  tail  if  such  is  indicated  by  the  data.  It  was  decided  to  allow  a  second 
maximum  at  very  low  abilities  because  the  data  seemed  substantially  better  fit 
when  bimodality  was  permitted.  A  substantive  interpretation  of  bimodality  is 
noted  below. 

After  some  preliminary  analyses,  we  decided  to  remove  examinees  who 
answered  less  than  half  of  the  items.  There  were  87  such  examinees,  leaving 
2,891  examinees  for  the  density  and  COCC  estimation. 

Figure  4  shows  the  obtained  density.  It  can  be  seen  that  the  density  is 
roughly  bell-shaped,  with  a  mode  near  0.  The  left  tail  turns  up  at  low 
abilities,  suggesting  a  relatively  large  number  of  examinees  with  very  low 
abilities.  One  substantive  interpretation  of  this  fat  left  tail  is  that  even 
among  examinees  who  answered  more  than  half  of  the  items  there  may  have  been 
some  who  were  poorly  motivated  and  did  not  make  a  serious  attempt  to  pass  the 
examination.  In  fact,  examinees  were  paid  to  take  the  examination  and 
consequently  some  of  them  may  not  have  been  adequately  motivated.  The  test 


Center  sample 


information  function  at  d  =  -5  is  very  low;  consequently,  Dimodality  cannot  be 
established  unequivocally  without  much  larger  samples. 

COCC  estimation.  Four  COCCs  were  estimated  for  each  item:  the  three 
incorrect  response  curves  and  an  omit  curve.  Omits  included  both  skipped 
responses  and  not-reached  responses.  The  number  of  orthonormal  basis 
functions  used  in  the  analysis  was  10.  Thus,  30  weights  (10  weights  for  each 
of  three  COCCs)  were  estimated  for  each  item.  The  weights  for  the  fourth  COCC 
were  a  known  linear  combination  of  the  weights  for  the  other  three  (Levine, 
1985b). 

Appendix  A  contains  plots  of  the  COCCs  estimated  for  all  30  AR  items. 

The  solid  curves  indicate  the  estimated  COCCs.  Each  page  in  Appendix  A 
contains  the  four  COCCs  for  two  items.  For  example,  the  first  page  of 
Appendix  A  has  the  COCCs  for  item  1  plotted  in  the  four  panels  to  the  left; 
the  four  panels  to  the  right  contain  COCCs  for  Item  2  of  the  AR  subtest.  For 
each  item,  the  top  left  panel  contains  the  COCC  for  the  first  incorrect 
option;  the  top  right  panel,  the  COCC  for  the  second  incorrect  option;  the 
bottom  left  panel,  the  COCC  for  the  third  incorrect  option;  and  the  bottom 
right  panel,  the  omit  COCC. 

The  goodness-of-f it  of  the  estimated  COCCs  can  be  evaluated  by  examining 
the  vertical  lines  displayed  in  each  panel.  These  lines  were  obtained  by 
computing  three-parameter  logistic  ability  estimates  for  all  11,914  examinees 
in  the  NORC  data  set,  forming  25  ability  strata  on  the  basis  of  estimated 
abilities  by  using  the  4th,  8th,  ...,  96th  percentiles  of  the  standard  normal 
distribution  as  cutting  scores,  and  then  computing,  from  among  the  subset  of 
examinees  who  answered  the  item  incorrectly,  the  proportion  of  examinees 
selecting  each  option.  The  centers  of  the  vertical  lines  correspond  to  the 
observed  proportions  and  they  are  plotted  above  the  category  medians  (the  2nd, 
6th,  ...,  98th  percentiles  of  the  standard  normal  distribution).  The  vertical 
lines  represent  approximate  95%  confidence  intervals  for  the  observed 
proportions  (i  two  standard  errors,  where  the  observed  proportion  is  used  to 
compute  the  standard  error).  Observed  proportions  of  0  and  1  are  plotted  as 
plus  signs  and  are  offset  slightly  from  their  true  locations  so  that  they  will 
be  visible. 

The  AR  items  seem  to  be  more-or-less  ordered  by  difficulty. 

Consequently,  the  95)1  confidence  intervals  for  the  first  few  items  in  Appendix 
A  are  very  wide  because  these  items  are  easy  and  so  few  examinees  chose 
incorrect  options.  Confidence  intervals  for  later  items  are  much  narrower  and 
provide  a  severe  test  for  COCC  estimates.  Item  27,  for  example,  shows  that 
the  COCC  estimates  provide  a  very  good  description  of  option  choice.  Notice 
that  the  COCC  for  the  omit  category  lies  below  most  observed  proportions. 

This  occurs  because  examinees  with  high  omitting  rates  were  excluded  from  the 
sample  used  to  estimate  COCCs,  but  were  included  in  the  total  sample  used  to 
compute  the  proportions  displayed  in  Appendix  A. 

COCC  estimation  verification.  The  figures  presented  in  Appendix  A  show 
that  MFS  estimates  of  COCCs  closely  follow  the  actual  patterns  of  item 
responses.  It  is  difficult,  however,  to  understand  the  accuracy  of  COCC 
estimates  from  these  figures  because  the  true  COCCs  are  not  known.  To  gain 
further  insights  into  the  properties  of  MFS  estimates  of  COCCs,  a  simulation 
data  set  of  3000  response  patterns  was  generated.  Simulated  abilities  were 


sampled  from  the  standard  normal  distribution,  probabilities  of  correct  and 
incorrect  responses  were  determined  from  the  ICCs  obtained  by  the  LOCIST  run 
described  previously,  and  probabilities  of  option  selections  (for  responses 
simulated  to  be  incorrect)  were  computed  using  the  MFS-est imated  COCCs. 

COCCs  were  re-estimated  from  the  simulation  data  set.  The  true  ability 
density  (the  standard  normal)  was  used  in  Equation  18,  and  the  true  ICC  values 
were  used  to  compute  probabilities  of  correct  and  incorrect  responses.  The 
true  ability  density  and  ICC  values  were  used  because  we  wanted  to  determine 
the  errors  of  COCC  estimates  in  a  way  that  was  not  confounded  with 
inaccuracies  in  density  estimates  and  ICC  estimates. 

The  results  of  the  simulation  study  are  shown  in  Appendix  B,  which 
presents  the  re-estimated  COCCs  for  all  30  items.  Heavy  lines  indicate  the 
re-estimated  COCCs  and  thin  lines  indicate  the  true  COCCs.  Observed 
proportions  and  their  approximate  95t  confidence  intervals  are  shown  for  the 
simulation  sample  of  N  :  3,000.  The  observed  proportions  were  not  plotted  if 
five  or  fewer  incorrect  responses  were  made  in  an  ability  stratum. 

Item  2  shows  estimated  COCCs  that  are  very  close  to  the  true  COCCs  for 
all  ability  levels.  This  is  remarkable  because  there  were  almost  no  incorrect 
responses  made  by  simulated  examinees  with  above-average  ability.  I  tern  3 
shows  that  we  cannot  always  expect  to  have  well-estimated  COCCs  when  there  are 
no  data  available:  Large  diffences  between  true  and  estimated  COCCs  occur  at 
high  ability  levels.  The  COCCs  were,  however,  accurately  estimated  in  ability 
ranges  for  which  there  were  more  than  a  handful  of  incorrect  responses. 

From  an  inspection  of  the  plots  in  Appendix  B,  it  seems  evident  that  COCC 
values  were  accurately  estimated  when  there  were  six  or  more  incorrect 
responses  in  adjacent  ability  strata.  Sometimes  COCC  values  were  well- 
estimated  when  fewer  incorrect  responses  were  available,  but  this  seemed  to  be 
a  matter  of  chance.  Notice,  also,  that  COCCs  for  the  omit  option  were  not 
underestimated  in  this  analysis  as  they  were  in  the  analysis  of  the  real  AR 
data.  In  this  analysis,  all  response  vectors  were  used;  there  was  no 
restriction  on  omitting  as  in  the  previous  analysis. 

Information  functions.  Information  functions  for  the  dichotomous  and 
polychotomous  modelings  of  the  AR  test  are  shown  in  Figure  5.  An  expression 
for  the  information  function  of  the  three-parameter  logistic  model  is 
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ve.  The  correct  option  makes  the  same  contribution  to  information  for 
dichotomous  and  poiychotomaus  scorings;  namely,  the  first  term  on  the 
des  of  Equations  21  and  22.  Thus,  any  differences  in  information  are 
due  to  the  treatment  of  incorrect  responses.  Although  it  is  not 
from  Equations  21  and  22,  it  can  be  shown  that  the  information 
for  the  polychotomous  model  equals  or  exceeds  the  three-parameter 
model's  information  function.  Thus,  any  increase  in  information  is 
due  to  polychotomous  scoring. 


Figure  5  shows  that  there  are  moderate  gains  in  information  due  to 
polychotomous  scoring  of  the  AR  items  for  low  to  moderately  high  abilities. 
These  gains  are  equivalent  to  adding  about  5  or  6  items  to  the  subtest. 

Little  or  no  information  is  gained  for  high  ability  examinees.  This  latter 
finding  is  not  surprising  because  high  ability  examinees  are  expected  to 
answer  nearly  all  the  items  correctly. 

It  should  be  noted  that  the  AR  items  were  not  written  with  polychotomous 
scoring  in  mind,  and  so  the  gains  in  information  shown  in  Figure  5  are  more- 
or-less  fortuitous.  Larger  gains  might  be  realized  if  item  writers  knew  the 
attributes  of  incorrect  options  that  typically  lead  to  substantial  increases 
in  information. 

Appropriateness  Measurement  for  the  AR  Subtest 


Purpose 

This  section  compares  the  effectivenesses  of  dichotomous  and 
polychotomous  models  for  detecting  aberrant  responses  patterns.  By  comparing 
detection  rates  of  optimal  indices,  it  is  possible  to  compare  the  maximum 
detection  rates  possible  for  a  given  form  of  aberrance.  As  in  the  previous 
section,  the  dichotomous  model  is  a  submodel  of  the  polychotomous  model; 
hence,  any  increase  in  detection  rates  is  due  to  modeling  incorrect  responses. 

Several  practical  indices  were  also  evaluated.  Most  of  these  indices  are 
computed  from  the  dichotomously  scored  item  responses.  One  index,  however,  is 
the  natural  extension  of  a  dichotomous  model  index  to  the  polychotomous  case. 
Detection  rates  for  the  practical  indices  will  indicate  (a)  which  are 
relatively  more  powerful  and  less  powerful,  and  (b)  the  extent  to  which  the 
maximum  detection  rates  are  attained. 

Overview 


The  ICCs  and  OCCs  estimated  for  the  AR  subtest  from  the  sample  of 
N  :  2,891  were  used  as  the  "true"  item  parameters  in  a  simulation  study. 
Initially,  a  sample  of  N  =  3,000  simulated  response  patterns  was  created  and 
used  as  a  test  norming  sample.  This  data  set  was  used  to  determine  the  item 
and  test  statistics  required  to  compute  all  but  two  (z  and  DF K )  of  the 

practical  appropriateness  indices  listed  in  the  next  section.  Then  a  normal 
sample  (appropriate  responding)  of  N  ;  4,000  response  vectors  was  created.  In 
addition,  16  aberrant  samples  of  N  -  2,000  were  generated  to  simulate  several 
forms  of  aberrance.  Optimal  indices  and  all  the  practical  indices  were  then 
computed  for  the  normal  sample  and  aberrant  samples.  Rates  of  detection  of 


aberrant  response  vectors  at  various  false  alarm  rates  were  determined  for 
each  appropriateness  index  and  each  form  of  aberrance. 


Appropriateness  Indices 

This  section  lists  the  appropriateness  indices  that  are  evaluated. 
Technical  details  about  the  indices  are  given  in  Chapter  2. 

Pol ychotomous  model  optimal  index  (LR^).  Denote  the  po lychotomously 

scored  response  vector  by  v.  The  polychotomous  model  optimal  index  studied 
here  is 
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where  the  probabilities  are  computed  using  three-parameter  logistic  ICCs  to 
determine  conditional  probabilities  of  correct  responses  and  MFS  OCCs  to 
determine  conditional  probabilities  of  incorrect  responses. 

Dichotomous  model  ootimal  index  ( LR  , ) .  This  index  is  identical  to  LR 

- -  p 

except  that  only  the  pattern  of  correct  and  incorrect  responses  u  is  used  in 
its  calculation.  This  class  of  indices,  therefore,  provides  the  highest  rate 
of  detection  when  the  choice  of  incorrect  option  is  ignored. 

Dichotomous  model  optimal  index  computed  using  estimated  item  parameters 
(LR;).  For  optimal  indices  to  be  truly  optimal,  they  must  be  computed  using 

item  parameters  —  not  item  parameter  estimates.  In  previous  work  (Levine  & 
Drasgow,  1982),  we  found  that  the  values  of  some  appropriateness  indices  were 
almost  unaffected  when  item  parameter  estimates  were  used  in  place  of  item 
parameters.  In  the  present  research,  we  also  computed  optimal  indices  for  the 
three-parameter  logistic  model  using  estimated  item  parameters. 


Dichotomous  and  polychotomous  model  standardized  (z 
Chapter  II,  z,  was  discussed;  z^  is  the  generalization  of  z, 
polychotomous  analysis  of  the  item  responses. 


and  z  ).  In 
P 

to  the  case  of  a 


Fit  statistics  (FI  and  F2).  (Discussed  in  Chapter  II.) 
Caution  indices  (S,  T2,  and  T4).  (Discussed  in  Chapter  II.) 


Item-option  variance  (IOV).  (Discussed  in  Chapter  II.) 

Likelihood  function  curvature  statistics  (JK  and  0/E).  (Discussed  in 
Chapter  II.) 


Deliberate  failure  key  (DFK).  The  final  index  evaluated  is  the 
DFK  developed  by  the  Navy  Personnel  Research  and  Development  Center  (Swanson  & 
1982)  to  detect  individuals  who  are  deliberately  attempting  to  obtain  low 
scores.  Although  DFK  was  developed  for  the  AFQT  composite,  we  used  the  key 
for  the  AR  subtest  only. 


Foley, 
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Method 


Data  Sets.  A  test  norming  sample  of  3,000  response  vectors  was  created 
by  sampling  3,000  numbers  Os)  from  the  normal  (0,1)  distribution  truncated  to 
the  [-5.0,  3.5]  interval.  A  normal  sample  of  4,000  response  vectors  was  also 
generated  in  this  way.  Then  2,000  aberrant  response  vectors  were  created  in 
each  of  16  conditions.  These  conditions  resulted  from  varying  three  factors: 
the  type  of  aberrance  (spuriously  high;  spuriously  low),  the  severity  of 
aberrance  (mild;  moderate),  and  the  distribution  from  which  simulated 
abilities  were  sampled. 

Eight  of  the  aberrant  samples  contained  spuriously  high  response  vectors, 
and  the  remaining  eight  samples  contained  spuriously  low  response  vectors. 
Spuriously  high  response  patterns  were  created  by  first  generating  normal 
response  vectors  (using  the  AR  three-parameter  logistic  ICCs  to  determine  the 
probabilities  of  correct  responses,  and  the  AR  COCCs  to  determine  the 
probabilities  of  incorrect  option  selection)  and  then  replacing  either  17% 
(mild  aberrance)  or  33%  (moderate  aberrance)  of  the  simulated  responses 
(randomly  sampled  without  replacement)  with  correct  responses.  Spuriously  low 
response  patterns  were  also  created  by  first  generating  normal  response 
vectors.  Then  17%  or  33%  of  the  items  were  randomly  selected  without 
replacement  and  the  responses  to  these  items  replaced  with  random  responses 
(i.e.,  a  response  was  replaced  by  option  A  with  probability  .25,  by  option  B 

with  probability  .25,  ...,  and  by  option  D  with  probability  .25). 

The  third  variable  manipulated  was  the  ability  level  of  the  aberrant 
sample.  Abilities  for  the  spuriously  high  samples  were  sampled  from  four 
parts  of  the  normal  (0,1)  distribution  truncated  to  [-5.0,  3.5]:  very  low 
(0th  through  9th  percentiles),  low  (10th  through  30th  percentiles),  low 
average  (31st  through  48th  percentiles),  and  high  average  (49th  to  64th 

percentiles).  In  all  cases,  percentiles  were  determined  after  the  truncation. 

Abilities  were  sampled  from  four  average  to  high  ability  strata  for  the 
spuriously  low  samples:  low  average  (31st  to  48th  percentiles),  high  average 
(49th  through  64th  percentiles),  high  (65th  through  92nd  percentiles),  and 
very  high  (93rd  percentile  and  above). 

Analysis.  The  analysis  followed  the  procedure  described  in  Chapter  II. 
All  the  item  and  test  statistics  required  to  compute  the  practical 
appropriateness  indices  were  computed  using  the  test  norming  sample.  L0GIST 
(Wood  et  al,,  1976)  was  used  to  estimate  three-parameter  logistic  item 
parameters  and  a  Fortran  program  was  written  to  compute  the  other  quantities 
required. 


The  practical  appropriateness  indices  and  LR(  were  then  computed  for 
the  response  vectors  in  the  normal  and  aberrant  samples.  Optimal  indices  were 
also  computed  for  the  normal  sample  for  four  aberrant  conditions:  17% 
spuriously  high,  33%  spuriously  high,  17%  spunously  low,  and  33%  spuriously 
low.  The  17%  spuriously  high  optimal  index  was  computed  for  the  four  samples 
with  this  form  of  aberrance,  the  3 3%  spuriously  high  optimal  index  was 
computed  for  the  four  samples  with  this  form  of  aberrance,  etc.  The  ICCs  and 


COCCs  used  to  generated  the  data  were  used  to  compute  LR 


P 


and  LRj. 
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Results 


The  results  for  the  spuriously  high  conditions  are  given  in  Tables  10 
through  13.  The  results  for  the  lowest  ability  group  are  shown  in  Table  10. 

In  this  table,  it  is  evident  that  cheating  on  five  randomly  selected  items  was 
not  very  detectable:  At  a  2?  false  alarm  rate,  only  28?  of  the  simulated 
cheaters  were  detected  by  the  optimal  LR^  index.  The  best  of  the  practical 

indices,  z3  and  F2,  detected  18?  and  20?,  respectively.  (The  higher  detection 
rate  of  IOV  resulted  because  this  index  is  poorly  standardized;  see  Chapter 
II.)  Cheating  on  10  items  (the  33?  condition)  was  reasonably  detectable.  For 
example,  LR^  detected  61?  and  LR 3  detected  54?  at  a  2?  false  alarm  rate.  At 

this  false  alarm  rate,  z3,  F2,  and  T4  detected  44?,  41?,  and  50?, 
respectively . 

The  detection  rates  of  the  optimal  indices  showed  a  relatively  small 

decline  from  Table  10  to  Table  11.  At  a  2?  false  alarm  rate,  LR  ,  for 

P 

example,  declined  from  28?  to  26?  for  the  17?  spuriously  high  treatment  and 
from  61?  to  53?  for  the  33?  treatment.  Most  of  the  practical  indices  showed 
larger  declines  in  detection  rates.  This  trend  continues  in  Table  12. 

Finally,  in  Table  13,  it  is  evident  that  simulated  cheating  on  the  AR 
subtest  was  almost  undetectable  for  high  average  examinees.  In  contrast, 
Drasgow  et  al.,  (1985)  found  moderate  detection  rates  for  simulated  cheaters 
with  comparable  abilities  for  the  SAT-V.  A  significant  difference  between  the 

two  tests  lies  in  the  frequency  (and  relative  frequency)  of  difficult  (b^  > 

1.0),  discriminating  (a  >  1.0)  items  with  low  lower  asymptotes  (c  £  .10). 
Seventeen  of  the  85  SAT-V  items  satisfied  these  three  conditions.  In 
contrast,  none  of  the  30  AR  items  met  these  conditions  and  only  three  items 

had  b^  >  1.0.  In  sum,  high  average  examinees  had  a  reasonably  good  chance  of 

responding  correctly  to  each  AR  item;  so, correct  responses  obtained  by 
cheating  were  not  clearly  aberrant. 

The  results  for  the  spuriously  low  samples  are  given  in  Tables  14  through 
17.  In  Table  14,  it  is  evident  that  33?  spuriously  low  responding  by 
simulated  low  average  examinees  was  moderately  detectable  by  LRp  (a  30? 

detection  rate  with  2?  false  alarms)  but  not  by  any  of  the  other 
appropriateness  indices.  Higher  detection  rates  were  obtained  for  simulated 
high  average  examinees  (shown  in  Table  15).  Again,  LR  performed 

substantially  better  than  any  other  index.  High  rates  of  detection  of 
simulated  high  and  very  high  examinees  are  shown  in  Tables  16  and  17.  LR^  was 

clearly  the  best  index,  with  detection  rates  of  72?  and  81?  for  a  2?  false 
alarm  rate  in  the  33?  spuriously  low  treatment. 


«r'. 


Table  10 .  Selected  ROC  Points  for  Spuriously  High 
Response  Patterns  Generated  from  the  0-9?  Ability  Range 


alarm 


J  i 


Table  12.  Selected  ROC  Points  for  Spuriously  High 
Response  Patterns  Generated  from  the  3 1 -48%  Ability  Range 


Table  13-  Selected  ROC  Points  for  Spuriously  High 
Response  Patterns  Generated  from  the  49-64%  Ability  Range 


False  Proportion  detected  by 

alarm  _ _ 


rate 

LR 

P 

LR, 

lr; 

z 

P 

2, 

FI 

F2 

S 

T2 

T4 

I0V 

JK 

0/E 

DFK 

17%  Spuriously  High  Treatment 

.001 

00 

00 

00 

00 

00 

00 

00 

00 

00 

00 

00 

00 

00 

00 

.005 

00 

00 

01 

00 

01 

00 

00 

00 

02 

03 

00 

00 

00 

00 

.01 

02 

01 

03 

00 

03 

01 

01 

00 

04 

04 

00 

00 

00 

00 

.02 

07 

06 

07 

01 

05 

01 

03 

00 

07 

08 

01 

00 

00 

00 

O 

to 

11 

09 

11 

01 

08 

04 

04 

00 

_n 

02 

00 

06 

00 

.04 

14 

13 

14 

02 

10 

06 

07 

01 

_1_4 

_1_4 

03 

00 

09 

00 

.05 

18 

16 

17 

03 

13 

08 

08 

01 

16 

II 

04 

00 

12 

00 

.07 

25 

23 

24 

06 

17 

11 

13 

03 

20 

2J_ 

07 

01 

17 

00 

.  10 

33 

30 

34 

09 

23 

16 

19 

05 

26 

27 

1 1 

07 

24 

03 

33%  Spuriously  High  Treatment 

.001 

01 

02 

01 

00 

00 

00 

00 

00 

00 

02 

00 

00 

00 

00 

.005 

05 

04 

03 

00 

03 

01 

01 

00 

05 

07 

00 

00 

01 

00 

.01 

08 

10 

11 

00 

04 

02 

04 

00 

07 

]_0 

00 

00 

02 

00 

.02 

19 

16 

18 

01 

07 

07 

08 

01 

12 

II 

01 

00 

06 

00 

.03 

28 

23 

25 

02 

10 

11 

11 

02 

16 

20 

01 

00 

08 

00 

.04 

34 

32 

32 

03 

12 

14 

15 

03 

20 

25 

03 

00 

11 

00 

.05 

37 

37 

36 

05 

16 

17 

17 

04 

23 

29 

04 

00 

14 

00 

.07 

48 

45 

46 

08 

19 

23 

23 

07 

28 

35 

05 

03 

20 

00 

.  10 

60 

55 

56 

13 

25 

31 

31 

12 

35 

40 

10 

1 1 

28 

01 

W.V 


Table  14_.  Selected  ROC  Points  for  Spuriously  Low 
Response  Patterns  Generated  from  the  3 1 -48%  Ability  Range 


False  Proportion  detected  by 

alarm  _ _ _ 

rate  LRp  LR,  LRJ  zp  z,  FI  F2  S  T2  T4  IOV  JK  0/E  DFK 


.001 

01 

00 

00 

00 

00 

00 

00 

00 

00 

00 

00 

00 

00 

00 

.005 

05 

01 

01 

03 

02 

00 

01 

00 

02 

02 

01 

00 

00 

00 

.01 

09 

03 

03 

05 

04 

01 

02 

00 

03 

03 

03 

01 

01 

01 

.02 

15 

06 

07 

08 

07 

02 

04 

00 

06 

07 

06 

01 

02 

01 

.03 

18 

10 

12 

11 

10 

04 

05 

01 

09 

09 

08 

02 

03 

06 

.04 

21 

14 

15 

11 

13 

07 

07 

03 

12 

12 

12 

03 

05 

06 

.05 

24 

17 

18 

11 

J5 

10 

09 

04 

14 

14 

14 

05 

07 

06 

.07 

29 

22 

23 

il 

19 

17 

12 

07 

18 

17 

20 

07 

10 

06 

.  10 

35 

28 

28 

27 

26 

25 

17 

1  1 

23 

22 

26 

12 

14 

20 

33*  sF 

Duriouslv 

Low 

Treatment 

.001 

07 

01 

01 

01 

02 

00 

00 

00 

00 

01 

02 

00 

00 

00 

.005 

14 

03 

04 

07 

05 

00 

04 

00 

03 

04 

04 

01 

01 

01 

.01 

22 

08 

09 

11 

10 

02 

07 

00 

05 

07 

07 

02 

01 

01 

.02 

30 

14 

16 

J8 

15 

05 

11 

03 

09 

1 1 

13 

04 

03 

01 

.03 

36 

20 

22 

il 

20 

09 

13 

06 

14 

15 

18 

07 

04 

16 

.04 

41 

24 

26 

27 

23 

13 

17 

10 

16 

19 

23 

10 

06 

16 

.05 

45 

29 

30 

11 

26 

17 

19 

1 1 

19 

22 

27 

13 

07 

16 

.07 

51 

36 

37 

1£ 

32 

27 

24 

17 

22 

27 

32 

17 

11 

16 

.  10 

59 

44 

44 

44 

38 

36 

31 

25 

29 

33 

41 

24 

16 

37 

A 


Table  1 6 .  Selected  ROC  Points  for  Spuriously  Low 
Response  Patterns  Generated  from  the  65-92%  Ability  Range 


False 

alarm 

rate 


Proportion  detected  by 


LR  LR,  LR',  z 
p  P 


IOV  JK  0/E  DFK 


17%  Spuriously 


Treatment 


001 

20 

05 

04 

00 

02 

00 

00 

00 

01 

02 

00 

00 

00 

00 

005 

30 

17 

16 

05 

07 

03 

02 

00 

08 

07 

00 

00 

00 

00 

01 

34 

24 

22 

08 

V3 

08 

06 

01 

12 

10 

02 

00 

00 

00 

02 

41 

30 

30 

14 

20 

20 

12 

04 

19 

17 

04 

00 

00 

00 

03 

43 

34 

33 

19 

26 

28 

15 

06 

24 

20 

05 

00 

06 

01 

04 

46 

38 

36 

23 

28 

32 

20 

10 

28 

24 

08 

00 

09 

02 

05 

49 

40 

39 

26 

31 

36 

22 

12 

H 

27 

10 

00 

12 

02 

07 

52 

43 

43 

33 

31 

41 

28 

17 

35 

32 

14 

03 

18 

02 

10 

56 

49 

49 

43 

45 

46 

38 

22 

43 

38 

20 

10 

24 

09 

33%  Spuriously 


Treatment 


.001 

38 

14 

17 

03 

11 

00 

01 

00 

08 

12 

06 

00 

00 

.005 

48 

24 

28 

20 

24 

02 

11 

00 

26 

25 

10 

00 

00 

.01 

55 

34 

38 

29 

36 

08 

19 

04 

33 

31 

15 

00 

08 

.02 

62 

41 

44 

38 

45 

24 

30 

1  1 

43 

42 

27 

00 

17 

.03 

65 

47 

50 

44 

11 

36 

37 

15 

H 

46 

33 

00 

22 

.04 

68 

50 

52 

49 

55 

43 

43 

19 

55 

51 

40 

00 

28 

.05 

71 

54 

55 

53 

59 

49 

46 

23 

58 

54 

43 

00 

32 

.07 

74 

54 

60 

61 

64 

57 

53 

31 

62 

59 

50 

07 

42 

.  10 

78 

64 

65 

69 

11 

64 

61 

41 

69 

65 

58 

22 

50 

%  „Ni  ... ■- j  ■  m  aTk.a.*^  ■.  A  fc. 


Table  17.  Selected  ROC  Points  for  Spuriously  Low 
Response  Patterns  Generated  from  the  93- 100%  Ability  Range 


False 

Proportion  detected  by 

»  “  » 

■> . 

alarm 

1 

rate 

LRp  LR,  LR;  zp  z,  FI  F2  S  T2  T4 

IOV  JK  0/E  DFK 

Discussion 


In  this  chapter,  we  described  Levine's  (1985a,  1985b)  theory  of 
psychological  measurement.  It  was  used  to  estimate  COCCs  for  a  sample  of 
2,891  examinees  who  responded  to  the  AR  subtest.  Good  to  excellent  fits  were 
obtained  when  the  estimated  COCCs  were  compared  to  empirical  proportions 
computed  from  the  responses  of  a  larger  sample  of  11,914  examinees.  A 
simulation  data  set  was  also  used  to  investigate  COCC  estimates.  Very 
accurate  estimates  were  obtained  for  ability  ranges  having  sufficient  numbers 
of  examinees  who  responded  incorrectly. 

The  test  information  function  of  the  polychotomous  model  was  found  to  be 
moderately  larger  than  the  three-parameter  logistic  information  function  for 
low  to  moderately  high  ability  levels.  Since  there  _i_s  information  in 
incorrect  options,  it  seems  prudent  to  use  it  if  items  are  expensive  to  write, 
if  the  number  of  items  that  can  be  administered  is  severely  limited,  or  if 
very  accurate  ability  estimates  are  required.  Furthermore,  we  can  now  study 
systematically  the  differences  between  items  with  informative  incorrect 
options  and  items  with  essentially  noninformative  incorrect  options.  It  may 
be  possible  to  identify  different  characteristics  of  these  two  types  of  items. 
Then  item  writers  could  explicitly  attempt  to  write  items  with  highly 
informative  incorrect  options  and  thus  increase  the  information  about  ability 
provided  by  tests. 

An  Appropriateness  Measurement  simulation  study  was  also  conducted  to 
compare  the  polychotomous  model  with  a  dichotomous  submodel;  namely,  the 
three-parameter  logistic.  Several  important  results  were  obtained.  First, 
for  the  spuriously  low  treatment  that  simulates  atypical  educations, 
misgridding  answers  to  a  portion  of  the  test,  unusual  creativity,  etc.,  we 
found  that  optimal  three-parameter  logistic  appropriateness  indices  fell  far 
short  of  their  optimal  polychotomous  model  counterparts.  At  some  false  alarm 
rates,  the  rates  of  detection  of  aberrant  response  vectors  were  more  than  100J 
higher  for  the  polychotomous  optimal  indices.  Thus,  Appropriateness 
Measurement  constitutes  one  important  practical  testing  problem  where 
substantial  gains  are  made  by  the  use  of  a  polychotomous  item  response  model. 


The  results  of  the  Appropriateness  Measurement  simulation  study  also 
showed  that  the  practical  polychotomous  model  index  z^  was  not  a  particularly 


good  index:  Its  detection  rates  were  not  close  to  optimal  for  either 
spuriously  high  or  spuriously  low  treatments.  This  result,  in  conjunction 
with  the  results  described  previously,  points  to  the  need  to  devise  better 
polychotomous  appropriateness  indices  that  can  be  used  in  practical 
situations. 


A  third  result  obtained  in  the  Appropriateness  Measurement  research 
reported  in  this  chapter  was  that  the  z,,  F2,  and  T4  indices  effectively 
detected  aberrance  in  relation  to  three-parameter  logistic  optimal  indices 
(but  not  polychotomous  model  optimal  indices).  Therefore,  if  one  is  satisfied 
with  dichotomous  scoring  of  item  responses  for  some  particular  application, 
then  z,,  F2,  and  T4  can  be  used  with  confidence  to  detect  inappropriate  test 
scores . 


In  sum,  COCC  estimates  provide  opportunities  to  improve  testing  in  a 
variety  of  ways:  ability  estimation,  the  theory  and  practice  of  item  writing, 
and  Appropriateness  Measurement.  Applications  in  areas  such  as  the  evaluation 
of  item  and  test  bias  and  adaptive  testing  may  also  be  fruitful. 

Consequently,  we  conclude  that  there  _i_s  information  in  incorrect  responses  and 
that  polychotomous  item  response  models  can  make  important  contributions  to 
psychological  testing. 


IV.  MULTI-TEST  EXTENSIONS  OF  PRACTICAL  AND 
OPTIMAL  APPROPRIATENESS  INDICES 


I ntroduct  ion 

This  chapter  describes  methods  for  efficient  detection  of  inappropriate 
test  scores  in  situations  where  examinees  complete  several  short  tests.  In 
particular,  information  about  aberrance  is  pooled  across  tests  that  measure 
distinct  traits.  This  approach  seems  valuable  for  test  batteries  such  as  the 
ASVAB,  which  contains  a  number  of  short  power  subtests. 

Model-based  approaches  to  the  detection  of  aberrant  response  patterns 
have  generally  assumed  that  the  latent  trait  space  is  unidimensional.  For 
example,  the  three-parameter  logistic  model  has  been  used  by  Levine  and  his 
colleagues  (Drasgow  4  Levine,  1986;  Drasgow  et  al.,  1985;  Levine  4  Drasgow, 
1982;  Levine  4  Rubin,  1979).  Tatsuoka  (Harnisch  4  Tatsuoka,  1983;  Tatsuoka, 
1984)  has  used  the  two-  and  three-parameter  logistic  models  for  her  extended 
caution  indices.  Wright  (1977)  has  tried  to  identify  individuals  who  do  not 
conform  to  another  unidimensionai  model;  namely,  the  Rasch  model. 

In  Chapter  II,  we  found  that  appropriateness  indices  can  provide  very 
high  detection  rates  for  long  unidimensional  tests.  Detecting  aberrant 
response  patterns  on  shorter  tests  was  shown  to  be  a  much  more  difficult  task 
in  Chapter  III.  What  can  be  done  to  increase  detection  rates  on  short  tests? 
The  solution  does  not  lie  in  better  appropriateness  indices  for  unidimensionai 
tests,  because  no  index  computed  from  the  item  responses  can  provide  higher 
detection  rates  than  the  optimal  index  used  in  Chapter  III.  This  fact  led  us 
to  devise  methods  for  pooling  information  about  aberrance  across  several 
short,  unidimensionai  tests. 

Another  approach  to  detecting  aberrant  response  patterns  uses  external 
information  to  predict  test  scores.  The  standardized  residual  (i.e.,  the 
standardized  error  of  prediction)  can  then  be  used  as  an  appropriateness 
index.  For  example,  test  scores  not  included  in  a  selection  composite  can  be 
used  to  predict  the  composite  score.  Persons  who  cheated  on  the  tests 
included  in  the  composite,  but  not  on  the  other  tests,  would  be  expected  to 
have  large  positive  standardized  residuals  and  therefore  be  identifiable. 
Similarly,  scores  from  operational  sections  of  a  test  can  be  used  to  predict 
scores  on  an  experimental  section  in  order  to  identify  examinees  who  do  not 
make  a  serious  effort  on  the  experimental  section.  These  examinees  would  be 
expected  to  have  large  negative  standardized  residuals. 
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Little  is  known  about  the  efficacy  of  the  standardized  residual  approach 
to  the  identification  of  aberrant  response  patterns.  In  the  second  study 
described  in  this  chapter,  we  evaluated  this  approach  and  compared  it  to 
model-based  methods  of  Appropriateness  Measurement. 


The  next  section  of  this  chapter  describes  multi-test  extensions  of  six 
practical  appropriateness  indices,  and  then  presents  one  means  of 
approximating  multi-test  optimal  indices.  The  approximation  and  multi-test 
practical  indices  were  evaluated  in  two  studies.  The  first  used  simulated 
ASVAB  data  so  that  all  assumptions  about  the  item  responses  (local 
independence,  three-parameter  logistic  item  characteristic  curves,  etc.)  were 
correct.  In  the  second  study,  an  actual  ASVAB  data  set  was  used  so  that  the 
performances  of  the  appropriateness  indices  could  be  evaluated  under  realistic 
conditions . 

Multi-Test  Extensions  of  Practical  Appropriateness  Indices 

The  basic  assumption  for  our  multi-test  indices  is  that  the  test  battery 
consists  of  several  unidimensional  tests.  Let  U.  =  (U  U  )  denote  the 

J  ""j 

random  vector  of  item  responses  for  test  j_,  J=1,  .  . .  ,  m,  let  Uj  =  (u  1 ,  ..., 
u^  )  denote  a  value  of  the  random  vector,  and  let  9  =  (6^  ...,  0^)  denote  a 

vector  containing  the  abilities  measured  by  each  of  the  m  tests.  Then 


P(U  ...,  Uje>  =  n  P(U.1 9) 


=  n  p(u  ,19  )  , 

J  =  i  J  J 

where  both  equalities  result  from  local  independence.  This  shows  that  the 
random  vectors  are  independent  after  conditioning  on  the  individual 

abilities  0j .  Consequently, 

m 

£(£,01,),  ...,  f^ujie)  =  n  p< fjCUj) |0j)  ,  (23) 

for  arbitrary  functions  _fj  (see  Chung,  1974,  p.  51),  which  means  that 
functions  of  the  item  response  are  also  conditionally  independent. 

Standardized  ln  .  The  significance  of  Equation  23  for  developing  multi¬ 
test  extensions  of  appropriateness  indices  will  be  illustrated  with 
the  standardized  20  indices.  Let 

'•  ■  E(ui  =  u, . %  = 


*...V4.Aa. 


where 


iiJ)  =  log  P(Uj  =  Uj I )  . 

Then 


E(l.)  =  E  E(i[J)] 

J=1 

and  by  Equation  23 

m  (  '  \ 

Var d.)  =  E  Var  ( 4,  )  . 

J=1 


Hence,  H0  can  be  standardized  by 
z  . 

—  ~  i/p  ‘ 

[Var(  ft.)] 


(24) 


Expressions  for  E(&0-^  )  and  Var(20^  )  were  given  by  Drasgow  et  ai.  (1985)  for 
dichotomously  and  polychotomously  scored  item  responses.  We  shall  denote  the 
standardized  2„  index  by  z,  when  the  three-parameter  logistic  model  is  used. 
The  index  is  denoted  z  when  it  is  based  on  a  polychotomous  model. 


In  practice,  the  8  are  not  known.  We  have  used  maximum  likelihood 

estimates  0^  in  place  of  the  0j  in  our  past  research  with  apparent  success 

(see  Drasgow  et  al.,  1985,  Figures  3  and  4).  Of  course  other  approaches  to 
estimation  could  be  used.  In  fact,  the  well-known  bias  of  maximum  likelihood 
estimates  suggests  that  perhaps  alternative  estimation  methods  should  be 
explored . 


Standardized  extended  caution  indices.  Let  T2^  and  T4^  denote 
Tatsuoka's  (1984)  second  and  fourth  extended  caution  indices  computed  for  the 

Jth  test.  Tatsuoka  found  that  e(T2^I8  )  =  0  and  provided  expressions  for 

E ( T2 ^  J  ^  |  0  ]  and  the  conditional  variances  of  T2^  and  T4^  .  The 

standardized  multi-test  extensions  of  the  two  appropriateness 
indices  are  then 


Z  Var  T2V  10 , 


Z  (T4(J)  -  E(T4(J)|0, 

■  1  ■ 

[ Z  Var(T4(J) 19. j ] 1/2 


Again,  it  is  necessary  to  substitute  estimates  for  the  0.  in  Equations  25  and 

26 . 

Fit  statistics.  The  squared  standardized  residual  fit  statistic 
described  by  Wright  (1977)  involves  an  item-by-item  standardization  of  the 
dichotomously  scored  item  responses.  Let  u . ^  equal  1  or  0  depending  upon 

whether  the  examinee's  response  to  item  _i  on  test  is  correct  or  incorrect, 
let  FLj(0j)  equal  the  probability  of  a  correct  response  to  this  item  among 

examinees  with  ability  0^,  and  let  Q.j(0j)  =  1  -  P  (0  ).  Then  a  multi-test 

extension  of  Wright's  statistic  is 


n  . 

m  J 

FI  s  £  Z 


‘  W1  i J( 0)— i J * 9J 


The  second  fit  statistic  that  we  investigated  was  described  by  Rudner 
(1983).  In  our  notation,  this  statistic  is 

F2(J)  =  R  /V  , 

where 


«j  •  -  £ij(9j: 


L  P«  ,(0)0,  ,(e)  . 

-  1  1 J  1 J 


An  extension  to  the  multi-test  case  is 


F2  =  Z  R  /  Z  V. 
1  =  1  J  i  =  1  J 


Unidimensional  Tests.  Levine  and  Drasgow  (1984)  showed  that  the  most 

powerful  appropriateness  index  for  a  given  form  of  aberrance  on  a 

unidimensional  test  is  the  likelihood  ratio  statistic  LR  given  in  Equation  2. 

In  our  past  research,  we  have  evaluated  the  integrals  in  P,,  ,(u)  and 

-Normal 

P . |  _(u)  by  Simpson's  rule,  and  used  about  20  values  of  0  to  give  the 

likelihood  ratio  LR  adequate  accuracy.  Although  these  numerical  integrations 
are  not  particularly  burdensome  for  a  modern  computer,  generalizations  to 
multi-test  optimal  indices  would  require  excessive  computations  to  evaluate 
multidimensional  integrals.  For  this  reason,  we  are  led  to  seek  a  way  to 
evaluate  the  integrals  that  will  have  a  more  convenient  multi-test 
generalization . 

Under  general  conditions,  it  can  be  shown  that  likelihood  functions 
asymptotically  (with  the  number  £  of  items)  have  the  shape  of  normal 
densities.  Consequently,  for  long  tests 

l0g  ^Jormal(u|0)  ^  +  b0  +  c  •  (29) 

Throughout  this  chapter,  we  shall  assume  that  the  ability  distribution 
f(0)  is  the  standard  normal,  whence  log[f(0)]  is  a  quadratic  in  0.  Therefore, 
both  1og[PNormal(uiG)-X(0) ]  and  log[ Pflberrant( ul 0) *1(9) )  should  be 

approximately  quadratic.  The  justification  of  this  approximation  lies  in  the 
high  degree  of  agreement  in  Equation  29  and  the  high  rates  of  detection  of 
aberrant  response  patterns  obtained  in  the  present  research.  The 
computational  details  needed  to  reproduce  our  algorithm  and  replicate  our 
results  follow. 

If 

i°g  t ^N0rmal<ul  0>’il(0>  J  *  a02  +  b0  +  c  (30) 

for  a  <  0  ,  then 

W«al(ul9)-i(2)ae  =  |  e<a9‘*be*c)de 

__  eceb2  /2k2  |  e-(.0-b/k2)2/[2(1/k2)]d0 


/nrreceb‘/(-“a) 


where  k  ^  /-2a  and  the  last  equality  results  from  recognizing  that  the 
integrand  in  the  previous  equation  is  proportional  to  a  normal  density. 

In  order  for  this  approximation  to  be  accurate,  the  quadratic  must  fit 

well  near  the  maximum  of  v(0)  -  log  [£..  ,  (u 1 0 )•  f(  0)  ] .  We  used  the 

■*-  —Normal  — 

following  iterative  procedure  to  obtain  the  quadratic.  It  begins  by 
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evaluating  y_  at  five  points:  0°  =  the  maximum  likelihood  estimate  9  of  0;  0° 

+  .3;  and  9°  +  .6.  Then  a  diagonal  weight  matrix  is  created  with  non-zero 
elements  exp(y_(©)-^(0° ) )  corresponding  to  the  five  0  values.  These  weights 
are  restricted  to  the  interval  [0.00001,  10.0]  for  computational  reasons. 

Then  the  method  of  weighted  least  squares  is  used  to  obtain  the  initial 
coefficients  (a0,  b°,  c°)  of  the  quadratic. 

The  maximum  of  the  fitted  quadratic  is  0'  -  -b°/2a°.  If  0’  is  within  .15 
of  9°,  the  iterative  procedure  ends;  otherwise,  five  new  0  values  are  selected 
as  8', 


0’  t  /(a°)‘1log(2/3)  , 


and 


0'  ♦  / (a0 ) "  1  log( 1/3)  . 

Then  the  weights  are  recomputed,  and  weighted  least  squares  is  used  to  obtain 

(a1,  b’,  c’).  This  process  continues  until  I0*+1  -  9*1  £  .15.  (Stricter 
convergence  requirements  did  not  seem  to  improve  the  approximation  in  Equation 
30 .  ) 


Two  restrictions  are  imposed  to  ensure  convergence: 
i)  aL  £  -.01  ; 

and 

ii)  I©1"1  -  9  L|  <  1.6/j_  • 

Convergence  is  usually  obtained  in  one  or  two  iterations. 

Plotted  in  Figure  6  are  98  of  100  pairs  of  likelihood  ratios.  The 

abcissa  values  are  the  likelihood  ratios  that  resulted  from  using  Simpson's 

rule  to  evaluate  P..  ,(u)  and  P..  .(u):  the  ordinate  values  resulted 

-Normal  -Aberrant  ' 

from  the  quadratic  approximations.  The  response  patterns  were  simulated 
normal  examinees  responding  to  a  30-item  test,  item  characteristic  curves  were 
three-parameter  logistic  ogives,  ability  was  distributed  as  standard  normal, 
and  the  form  of  aberrance  was  15%  spuriously  low.  The  two  pairs  of  points  not 
plotted  are  (3.90,  3.91)  and  (5.07,  5.03). 

In  Figure  6,  it  is  clear  that  the  quadratic  approximation  was  very 
accurate  for  likelihood  ratios  of  less  than  2.0.  It  was  somewhat  less 
accurate  for  larger  values.  In  a  variety  of  other  tests,  we  found  the 
approximation  to  be  accurate  for  other  aberrance  hypotheses,  for  both 
simulated  normal  and  simulated  aberrant  response  patterns. 

As  a  final  check  on  the  quadratic  approximation,  we  determined  hit  rates 
for  the  33%  spuriously  low  condition  using  the  it'~~’  parameters  from  Chapter 
III.  In  this  analysis,  response  vectors  were  gene,  ted  from  abilities  in  the 
86  to  92  percentile  range,  and  likelihoods  were  computed  by  Simpson's  rule  and 
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by  the  quadratic  approximation  method.  The  detection  rates  at  several  false 
alarm  rates  are  given  below.  LR,  denotes  the  optimal  index  for  the 
dichotomously  scored  item  responses  (ICCs  were  three-parameter  logistic 
ogives),  and  LR^  denotes  the  optimal  index  for  polychotomously  scored  item 

responses.  It  is  clear  that  the  quadratic  approximation  is  sufficiently 
accurate  for  our  purposes. 


I  ndex 

Method 

False 

A  larm 

Rate 

.001 

.01 

.03 

.05 

.  10 

LR 

p 

Simpson 

■  53 

.67 

.75 

.78 

.84 

LR 

p 

Quad.  Approx. 

.54 

.66 

.75 

.79 

•  85 

LR  j 

S impson 

.31 

.53 

.63 

.68 

.75 

LR, 

Quad.  Approx. 

■  33 

.51 

.61 

.66 

•  73 

Two  unidimensional  tests.  The  likelihood  that  we  must  approximate  is 

f»  =  n  p(u1  -  i^ie^  p(u2  :  u2ie2)  <t>2( e;o, E)de  ,  (3D 

where  £(Uj  =  u j 1 0 j )  is  the  likelihood  of  u^,  z  1,2,  under  either  the 
normal  or  aberrant  model,  0  =  (O^O^',  0  =  (0,0)', 


Z 


j  1 

■P 


is  the  covariance  matrix  of  the  two  traits,  and  $2  is  the  bivariate 
standard  normal  density, 

<H2(9;0,E)  =  (detE)~ 1  /2(2ti)  "  1  exp[  -  j  0'  Z~ 1 0]  . 

The  final  expression  for  the  approximation  and  its  derivation  are  given 
in  Appendix  C.  The  final  expression  depends  only  on  the  correlation  p  between 
01  and  02,  which  is  assumed  to  be  known,  and  the  coefficients  (a 1 ,  b 1 ,  c^ )  and 

(a.2,  ^2’  —2^  the  9uadratic  approximations  that  can  be  fitted  to  the 

likelihood  functions  of  the  two  tests  separately  by  the  method  described  for  a 
unidimensional  test.  Thus,  we  can  fit  quadratics  to  each  separately  by  the 
method  previously  described  and  then  easily  compute  the  approximation  to  FV 

Study  One:  Simulated  ASVAB  Data 

Purpose .  How  effective  are  the  practical  multi-test  appropriateness 
indices  relative  to  optimal  multi-test  appropriateness  indices?  What  are  the 
upper  limits  on  the  detectabilities  of  certain  benchmark  forms  of  aberrance 
when  information  from  several  short  tests  is  combined? 


In  order  for  the  optimal  indices  to  be  truly  optimal,  all  assumptions 
used  to  specify  the  index  must  be  true.  For  this  reason,  data  were  simulated 


in  Study  One  that  perfectly  satisfied  all  assumptions.  In  Study  Two,  an 
actual  ASVAB  data  set  was  used  so  that  we  could  evaluate  the  properties  of  the 
optimal  and  practical  indices  in  realistic  settings. 

Data  generation.  The  ASVAB  AR  subtest,  the  first  of  our  two 
unidimensional  tests,  is  a  30-item,  four-option  multiple-choice  test.  A 
sample  of  N  =  2,978  examinees  was  taken  from  the  NORC  data  set  by  selecting 
every  fourth  examinee  (examinees  1,  5,  9,  ...).  The  LOGIST  (version  2B) 
computer  program  (Wood  et  al . ,  1976)  was  used  to  estimate  three-parameter 
logistic  ICCs.  OCCs  for  the  incorrect  option  (with  omitted  and  not-reached 
treated  as  a  single  incorrect  option)  were  estimated  by  means  of  Levine's 
(1985a;  1985b)  MFS  theory.  A  detailed  description  of  these  analyses  was 
presented  in  Chapter  III. 

The  15-item  Paragraph  Comprehension  subtest  and  the  35- item  Word 
Knowledge  subtest  of  the  ASVAB  were  pooled  to  form  our  second  unidimensional 
test.  These  two  tests  correlate  .82  (Ree,  Mullins,  Mathews,  &  Massey,  1982), 
and  their  correlation  corrected  for  attenuation  is  .96.  Consequently,  fitting 
unidimensional  item  response  models  to  the  pooled,  50-item  Word  Knowledge  - 
Paragraph  Comprehension  (WKPC)  subtest  seemed  justified. 

As  with  the  AR  subtest,  LOGIST  was  used  to  estimate  ICCs,  and  MFS  was 
used  to  estimate  OCCs.  Plots  showing  estimated  curves  and  empirical 
proportions  indicated  good  fits  of  both  the  ICCs  and  OCCs  to  the  data. 

The  ICCs  and  OCCs  estimated  from  the  AR  and  WKPC  subtests  were  used  as 
the  "true"  ICCs  and  OCCs  for  the  rest  of  Study  One.  As  the  first  step  in  the 
simulation,  a  sample  of  3,000  simulated  response  patterns  was  created  and  used 
as  a  test  norming  sample.  The  ICCs  previously  estimated  were  used  to 
determine  probabilities  of  correct  responses,  and  the  MFS  OCCs  were  used  to 
determine  the  probabilities  of  incorrect  options.  Abilities  for  the  two  tests 
were  sampled  from  a  bivariate  standard  normal  distribution  with  the 
correlation  parameter  set  equal  to  .8  (the  correlations  of  WK  and  PC  with  AR 
are  about  .8  after  correcting  for  unreliability;  see  Ree  et  al . ,  1982).  Thus, 
for  each  simulated  response  pattern,  a  vector  (9^  9^ )  was  sampled  from  a 

bivariate  standard  normal  with  a  correlation  of  .8;  81  and  the  AR  ICCs  and 

OCCs  were  used  to  simulate  a  polychotomously  scored  30-item  unidimensional 
test;  and  9^  and  the  WKPC  ICCs  and  OCCs  were  used  to  simulate  a 

polychotomously  scored  50-item  unidimensional  test.  The  entire  response 
vector  of  80  items  was  taken  as  the  data  provided  by  one  simulee. 

The  test  norming  sample  was  then  used  to  determine  the  item  and  test 
statistics  required  to  compute  the  multi-test  practical  appropriateness 
indices  based  on  the  three-parameter  logistic  model  (z,,  T2,  T4,  FI,  F2). 

This  entailed  two  runs  of  LOGIST  (one  for  the  simulated  AR  and  one  for  the 
simulated  WKPC)  and  two  runs  of  our  own  FORTRAN  program. 

A  normal  sample  of  4,000  response  vectors  and  16  aberrant  samples  of 
2,000  response  vectors  each  were  then  created.  The  normal  sample  was 
generated  exactly  as  was  the  test  norming  sample  (except,  of  course,  that 
different  seeds  were  used  for  the  random  number  generators).  As  in  Chapters 
II  and  III,  the  aberrant  samples  resulted  from  varying  three  factors:  the 


type  of  aberrance  (spuriously  high;  spuriously  low),  the  severity  of  aberrance 
(mild;  moderate),  and  the  distribution  from  which  simulated  abilities  were 
sampled. 

Eight  of  the  aberrant  samples  contained  spuriously  high  response  vectors, 
and  the  remaining  eight  samples  contained  spuriously  low  response  vectors. 
Spuriously  high  response  patterns  were  created  replacing  a  given  percentage  k 
of  simulated  responses  (randomly  sampled  without  replacement)  with  correct 
responses  for  each  of  the  two  simulated  unidimensional  tests  separately. 
Spuriously  low  response  patterns  were  also  created  by  applying  the  spuriously 
low  manipulation  to  each  of  the  two  unidimensional  tests  separately.  Mildly 
aberrant  response  patterns  were  generated  by  using  k_  =  15?  (i.e.,  5  of  30  AR 
items  and  8  of  50  WKPC  items).  Moderately  aberrant  response  patterns  were 
created  using  k^  =  30?  (i.e.,  9  of  30  AR  it-ems  and  15  of  50  WKPC  items). 

The  third  variable  manipulated  was  the  ability  level  of  the  aberrant 
sample.  A  composite  ability  was  computed  for  each  examinee  by  the  formula 


01  +  02 


[Var(01  ♦  02) 


1/2 


(91  ♦  02 ) / 1 . 9  . 


Notice  that  the  composite  ability  has  a  standard  normal  distribution. 
Composite  abilities  for  the  spuriously  high  samples  were  sampled  from  four 
parts  of  the  standard  normal  distribution:  very  low  (0th  through  9th 
percentiles),  low  (10th  through  30th  percentiles),  low  average  (31st  through 
48th  percentiles),  and  high  average  (49th  to  64th  percentiles).  Composite 
abilities  were  sampled  from  four  average  to  high  ability  strata  for  the 
spuriously  low  samples:  low  average  (31st  to  48th  percentiles),  high  average 
(49th  through  64th  percentiles),  high  (65th  through  92nd  percentiles),  and 
very  high  (93rd  percentile  and  above). 


Analysis.  The  practical  appropriateness  indices  were  computed  for  the 
4000  response  vectors  in  the  normal  sample.  The  item  and  test  statistics 
estimated  from  the  test  norming  sample  were  used  to  compute  all  but  one 
appropriateness  index.  The  one  exception  was  the  standardized  20  index 
computed  from  the  polychotomously  scored  item  responses,  denoted  z  .  It  was 


computed  using  the  true  OCCs  and  ICCs.  This  allowed  us  to  bypass  estimation 
of  OCCs  from  the  test  norming  sample  and  provided  a  significant  reduction  in 
computing  time.  (Despite  the  advantage  gained  b'y  being  computed  from  true 
rather  than  estimated  OCCs,  it  is  shown  below  that  z  fell  short  of  some  other 


indices . 


Therefore, 


the  advantage  given  to  z 


P 


was  of  little  practical 


consequence . ) 


One  non-IRT  index  was  also  computed:  the  Deliberate  Failure  Key  (DFK), 
which  was  provided  by  the  AFHRL. 


Optimal  appropriateness  indices  were  computed  (using  the  true  OCCs  and 
ICCs)  for  the  normal  sample  for  four  aberrant  conditions:  15?  spuriously 
high,  30?  spuriously  high,  15?  spuriously  low,  and  30?  spuriously  low.  For 
each  of  these  conditions  two  optimal  appropriateness  indices  were  computed. 
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The  first,  LR^,  is  the  optimal  index  for  pol ychotomous  scoring  of  the  item 

responses.  The  second  index,  LR,,  results  from  using  only  the  information  in 
the  dichotomously  scored  item  responses.  Thus,  LR,  is  based  on  a  submodel  for 
the  polychotomous  data  in  which  all  the  incorrect  responses  are  grouped 
together . 

The  practical  appropriateness  indices  were  computed  for  each  of  the  16 
aberrant  samples.  In  addition,  the  three-parameter  logistic  and  polychotomous 
model  15%  spuriously  high  optimal  indices  were  computed  for  the  four  samples 
with  this  form  of  aberrance,  the  30%  spuriously  high  optimal  indices  were 
computed  for  the  four  samples  with  this  form  of  aberrance,  etc. 

Results.  The  results  for  the  spuriously  high  conditions  are  given  in 
Tables  18  through  21,  and  results  for  the  spuriously  low  conditions  are  given 
in  Tables  22  through  25.  These  tables  show  that  the  multi-test  extensions 
provide  sizable  gains  in  detection  rates.  Table  18,  which  presents  the 
results  for  the  lowest  ability  range,  illustrates  this  point.  At  a  1  %  false 
alarm  rate  for  the  15 %  spuriously  high  condition,  the  polychotomous  optimal 
index  LR^  detected  22%  of  the  aberrant  response  patterns  if  only  the  AR  item 

responses  were  used,  37%  from  the  WKPC  item  responses,  and  55%  from  the 
combined  80  items.  In  Chapter  II,  we  obtained  a  50%  detection  rate  under 
these  conditions  (15%  spuriously  high,  0  to  9th  percentile  ability  range)  for 
an  85-item  unidimensional  test.  In  fact,  our  polychotomous  model,  multi-test 
optimal  index  provided  detection  rates  that  are  very  similar  to  the  rates 
obtained  in  Chapter  II:  At  false  alarm  rates  of  3%,  5%  and  10%,  our  hit  rates 
were  67%,  72%,  and  78%  for  the  15%  spuriously  high  treatment,  respectively; 
the  hit  rates  in  Chapter  II  were  64%,  70%  and  77%.  For  the  30%  spuriously 
high  treatment  at  false  alarm  rates  of  1%,  3% ,  5%,  and  10%,  the  hit  rates  were 
88%,  92%,  94%  and  95%,  respectively;  the  hit  rates  in  Chapter  II  were  93%, 

95%,  97%  and  98%. 

Comparisons  of  Tables  18  through  25  with  our  earlier  results  reveal  that 
the  polychotomous  model,  multi-test  optimal  indices  provide  detection  rate, 
that  are  generally  similar  to  the  rates  provided  by  the  polychotomous  model 
optimal  indices  for  the  long  unidimensional  test.  The  differences  that  occur 
seem  to  be  more  due  to  the  differences  in  the  characteristics  of  the  item 
pools  (the  items  in  the  earlier  study  tended  to  be  more  difficult  than  the 
items  used  here)  than  to  the  dimensionality  of  the  latent  trait  space  (i.e., 
use  of  the  multi-test  extensions). 

The  hit  rates  for  the  multi-test  practical  appropriateness  indices  are 
less  similar  to  the  hit  rates  of  practical  indices  on  long  unidimensional 
tests.  The  differences  are  particularly  obvious  for  the  spuriously  high 
conditions.  Perhaps  the  best  way  to  illustrate  the  differences  is  to  compare 
the  detection  rate  of  the  best  practical  index  to  the  detection  rate  of  the 
optimal  index.  At  a  1%  false  positive  rate  for  the  30%  spuriously  high 
treatment  in  Table  18,  this  ratio  equals  .75  for  z,  divided  by  .88  for  L ; 

namely,  .75/. 88  =  .85.  The  corresponding  ratio  was  .98  in  Chapter  II  (.91  for 
T2  divided  by  .93  for  LR^) .  For  the  next  higher  ability  range  (10th  through 

30th  percentiles),  the  ratio  is  .58  in  Table  19;  the  corresponding  ratio  from 
Chapter  II  is  .91.  Finally,  tne  ratio  for  the  low  average  ability  range  from 
Table  20  is  .47,  and  the  ratio  from  Chapter  II  is  .73. 
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Table  18.  Selected  ROC  Points  for  Spuriously  High 
Response  Patterns  Generated  from  the  00-0951  Ability  Range 


False  Proportion  detected  by 

alarm  _ 


rate 

Test 

LR 

P 

LR, 

z 

P 

z> 

FI 

F2 

T2 

T4 

15X  Spuriously  High  Treatment 

.001 

AR 

06 

03 

00 

02 

00 

00 

01 

01 

WKPC 

19 

07 

00 

05 

00 

01 

01 

03 

MT 

26 

15 

01 

J 12 

00 

01 

04 

04 

.01 

AR 

22 

20 

04 

13 

02 

12 

06 

07 

WKPC 

37 

22 

04 

24 

00 

10 

07 

09 

MT 

55 

37 

07 

ii 

00 

18 

15 

14 

.03 

AR 

38 

31 

09 

25 

04 

24 

19 

16 

WKPC 

49 

35 

10 

41 

00 

22 

17 

17 

MT 

67 

48 

14 

56 

03 

37 

23 

25 

.05 

AR 

46 

39 

14 

33 

13 

32 

26 

21 

WKPC 

57 

41 

15 

50 

00 

30 

24 

23 

MT 

72 

53 

19 

65 

07 

49 

39 

32 

.  10 

AR 

55 

50 

25 

50 

35 

49 

42 

33 

WKPC 

66 

48 

25 

63 

13 

47 

37 

35 

MT 

78 

62 

28 

Ii 

40 

66 

56 

30J  Spuriously  High  Treatment 

.001 

AR 

29 

21 

00 

12 

00 

00 

10 

06 

WKPC 

42 

19 

00 

21 

00 

01 

14 

17 

MT 

74 

44 

02 

44 

00 

04 

34 

31 

.01 

AR 

52 

42 

07 

37 

01 

24 

28 

27 

WKPC 

68 

41 

07 

50 

00 

18 

33 

34 

MT 

88 

69 

13 

75 

00 

44 

60 

56 

.03 

AR 

66 

57 

17 

52 

10 

42 

48 

41 

WKPC 

79 

53 

15 

67 

00 

39 

50 

48 

MT 

92 

77 

25 

86 

00 

67 

77 

71 

.05 

AR 

72 

64 

25 

62 

26 

52 

58 

WKPC 

82 

59 

22 

76 

03 

50 

60 

55 

MT 

94 

80 

33 

90 

21 

79 

85 

78 

.  10 

AR 

79 

71 

39 

76 

52 

69 

73 

65 

WKPC 

86 

64 

34 

84 

34 

70 

73 

69 

MT 

95 

84 

47 

95 

67 

89 

91 

88 

Table  19-  Selected  ROC  Points  for  Spuriously  High 
Response  Patterns  Generated  from  the  10-30*  Ability  Range 


Proportion  detected  by 

Test  LRp  LRl  T  z~,  FT  F2  T2  T4  DFK 


Table  20.  Selected  ROC  Points  for  Spuriously  High 
Response  Patterns  Generated  from  the  3 1  - 48%  Ability  Range 

Proportion  detected  by 

Test  LRp  LR^  T  T,  Fi  F2  T2  T4  DFK 


Table  2 1 .  Selected  ROC  Points  for  Spuriously  High 
Response  Patterns  Generated  from  the  49-64X  Ability  Range 


False 
a  larm 

Proportion  detected  by 

.W 

rate  Test 

LR 

P 

LR, 

z  z,  FI  F2  T2 

P 

T4 

DFK 

•« 


False 

alarm 

rate 


Table  24.  Selected  ROC  Points  for  Spuriously  Low 
Response  Patterns  Generated  from  the  65-92%  Ability  Range 


Proportion  detected  by 


rate 

Test 

I.R 

LR, 

z 

Z) 

FI 

F2 

T2 

T4 

DFK 

P 

P 

15%  SDuriously 

Low 

Treatment 

.001 

AR 

14 

04 

00 

02 

00 

00 

02 

01 

WKPC 

42 

27 

03 

05 

00 

00 

18 

14 

MT 

55 

27 

08 

12 

00 

00 

19 

14 

00 

.01 

AR 

34 

19 

08 

13 

13 

05 

09 

09 

WKPC 

66 

49 

22 

25 

16 

09 

35 

30 

MT 

74 

56 

30 

34 

22 

14 

M 

34 

00 

.03 

AR 

44 

31 

19 

22 

28 

14 

21 

19 

WKPC 

73 

61 

42 

45 

43 

25 

50 

42 

MT 

81 

69 

52 

54 

53 

31 

53 

47 

00 

.05 

AR 

49 

38 

26 

29 

36 

20 

27 

25 

WKPC 

75 

65 

54 

55 

58 

35 

58 

49 

MT 

84 

73 

M 

62 

63 

43 

62 

56 

00 

.  10 

AR 

56 

46 

42 

42 

47 

33 

42 

36 

WKPC 

80 

72 

69 

69 

74 

56 

68 

63 

MT 

87 

79 

76 

73 

79 

59 

73 

69 

00 

30%  Spu 

r  iously 

Low 

Treatment 

.001 

AR 

28 

10 

01 

1 1 

00 

00 

1 1 

07 

WKPC 

61 

42 

10 

22 

00 

01 

39 

37 

MT 

81 

63 

29 

41 

00 

02 

48 

43 

00 

.01 

AR 

48 

31 

22 

30 

12 

16 

26 

24 

WKPC 

81 

64 

46 

55 

02 

25 

57 

57 

MT 

89 

76 

67 

69 

07 

39 

66 

66 

00 

.03 

AR 

61 

45 

38 

42 

31 

28 

40 

38 

WKPC 

85 

75 

68 

71 

30 

48 

69 

67 

MT 

92 

84 

82 

82 

48 

60 

78 

77 

00 

.05 

AR 

67 

50 

47 

48 

44 

37 

46 

45 

WKPC 

88 

79 

79 

78 

54 

60 

75 

73 

MT 

94 

87 

§8 

86 

65 

72 

83 

83 

00 

.  10 

AR 

74 

60 

63 

61 

57 

52 

b0 

56 

WKPC 

93 

85 

89 

8b 

79 

76 

82 

82 

MT 

96 

91 

£4 

92 

85 

8u 

90 

89 

u  1 

■l 
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Table  25.  Selected  ROC  Points  for  Spuriously  Low 
Response  Patterns  Generated  from  the  93-1 00%  Ability  Range 


Discussion.  The  comparisons  of  the  detection  rates  of  the  multi-test 
practical  indices  to  rates  for  LR^  show  an  important  difference  between 

unidimensional  Appropriateness  Measurement  and  multidimensional 
Appropriateness  Measurement.  Specifically,  z,,  T2,  and  T4  efficiently 
detected  spuriously  high  response  patterns  on  the  long  unidimensional  SAT -V. 
Tables  18  through  21  show  that  we  did  not  replicate  this  finding  with  the 
short  AR  and  WKPC  tests:  There  are  substantial  differences  in  hit  rates 
between  practical  and  optimal  multi-test  appropriateness  indices.  This 
finding  provides  a  motivation  for  seeking  better  practical  appropriateness 
indices . 


Study  Two:  Actual  ASVAB  Data 


Purpose.  Do  the  results  obtained  for  simulated  ASVAB  data  generalize  to 
actual  ASVAB  data?  In  previous  research  (Drasgow  et  al.,  1985;  Levine  & 
Drasgow,  1982),  we  found  that  unidimensional  20  appropriateness  indices 
provided  similar  rates  of  detection  with  real  and  simulated  data.  Will  we 
obtain  similar  results  for  the  multi-test  extensions  of  the  standardized  20 
index  and  the  other  appropriateness  indices? 


For  an  optimal  appropriateness  index  to  be  truly  optimal,  ICCs  (and  OCCs 
if  the  analysis  is  polychotomous )  must  be  known  and  must  fit  the  data,  tests 
assumed  to  be  unidimensional  must  be  truly  unidimensional,  the  correlation 
between  ability  on  test  one  and  ability  on  test  two  must  be  known,  and  the 
ability  density  must  be  known.  We  violated  all  of  these  conditions  in  Study 
Two.  To  what  extent  will  detection  rates  for  optimal  indices  be  degraded? 

Data  sets.  The  NORC  sample  provided  the  data  base  for  Study  Two.  The 
test  norming  sample  consisted  of  responses  of  the  N  =  2,978  NORC  examinees 
analyzed  in  the  first  phase  of  Study  One.  The  AR  and  WKPC  ICCs  and  OCCs 
estimated  from  this  sample  were  used  for  all  analyses  in  Study  Two.  Also,  the 
statistics  needed  for  the  T2  and  T4  indices  were  obtained  from  this  sample. 
Finally,  a  standardized  residual  (SR)  measure  was  created  by  first  regressing 
the  total  number-right  score  from  the  AR  and  WKPC  subtests  on  the  Math 
Knowledge  (MK)  and  General  Science  \GS)  subtests  of  the  ASVAB, 


Predicted  (AR  +  WKPC)  =  B,  +  82MK  +  B,GS 

r  7.98  +  1 . 20MK  +  1 .88GS  , 

and  then  standardizing  the  residual 

AR  +  WKPC  -  Predicted  (AR  +  WKPC) 

as  described  by  Cook  and  Weisberg  (1982).  The  correlation  between  MK  and  AR, 
after  correcting  for  attenuation,  is  .88;  the  corrected  correlation  between  CS 
and  WK  is  .94;  and  the  corrected  correlation  between  GS  and  PC  is  .90  (Ree  et 
ai . ,  1982).  Large  positive  values  of  SR  were  used  to  indicate  spuriously  high 
test  scores,  and  large  negative  values  of  SR  were  taken  to  indicate  spuriously 
. uw  scores . 
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A  normal  sample  of  2,716  response  vectors  was  formed  by  selecting  every 
fourth  examinee  (2,  6,  10,  ...)  from  the  NORC  sample,  and  then  deleting  the 
data  from  the  262  examinees  who  failed  to  answer  at  least  77%  of  the  items  on 
both  the  AR  and  the  WKPC  subtests.  The  requirement  that  examinees  answer  at 
least  77?  of  the  items  is  based  on  the  Drasgow  et  al .  (1985)  conclusion  that 
test  scores  of  individuals  who  answer  less  than  77?  of  the  test  are  very 
likely  to  be  invalid  measures  of  ability. 

The  remaining  examinees  from  the  NORC  sample  (examinees  3,  4,  7,  8,  11, 
12,...)  were  used  to  form  six  more  samples.  These  samples  were  created  by 
first  determining  the  frequency  distribution  of  total  score  across  both  the  AR 
and  WKPC  subtests  (i.e.,  AR  +  WKPC);  sorting  into  groups  on  the  basis  of  the 
percentiles  used  for  the  AFQT  Categories;  and  finally,  removing  examinees  who 
answered  fewer  than  77?  of  the  items  on  either  the  AR  or  WKPC  subtests.  Score 
ranges  and  sample  sizes  for  the  six  groups  were: 

AR  +  WKPC  Sample 


Sample 

Score  Range 

Size 

very  high 

74 

to 

80 

494 

high 

59 

to 

73 

1537 

high  average 

50 

to 

58 

941 

low  average 

39 

to 

49 

959 

low 

24 

to 

38 

1155 

very  low 

0 

to 

23 

342 

Aberrant  samples  were  formed  exactly  as  in  Study  One.  Thus,  the  15?  and 
30?  spuriously  high  treatments  were  applied  to  the  four  lowest  ability  groups, 
and  the  15?  and  30?  spuriously  low  treatments  were  applied  to  the  four 
highest  ability  groups. 

Analysis.  Appropriateness  indices  were  computed  as  in  Study  One,  with 
the  main  exception  that  optimal  indices  were  computed  with  ICCs  and  OCCs 
estimated  from  the  test  norming  sample.  The  correlation  between  0,  and  02  was 
assumed  to  be  .8,  and  the  ability  density  was  assumed  to  be  the  standard 
normal  truncated  to  (-5.0,  3.5).  Appropriateness  indices  were  computed  for 
the  six  samples  stratified  on  ability,  before  the  aberrance  treatments  as  well 
as  after  each  aberrance  treatment. 

Index  standardization.  Although  each  practical  appropriateness  index 
(except  DFK)  was  standardized,  the  expressions  for  the  conditional 
expectations  and  variances  of  the  indices  were  obtained  using  the  assumption 
that  0,  and  02  were  known.  Of  course,  in  practice,  they  are  unknown; 
therefore,  it  is  important  to  investigate  the  conditional  distributions  of  the 
appropriateness  indices  for  normal  examinees. 


The  standardizations  of  the  practical  indices  can  be  determined  from 
Figure  7.  This  figure  presents  ROC  curves  for  seven  practical  appropriateness 


indices:  z^,  z,,  FI,  F2,  T2,  T4,  and  SR.  Abcissa  values  in  all  cases  were 

determined  from  the  normal  sample  of  2,716  examinees.  For  the  top  row  of  the 
figure,  ordinate  values  were  based  on  the  responses  of  the  342  examinees  in 
the  very  low  ability  range  prior  to  any  aberrance  manipulation  (i.e.,  this 
sample  was  simply  a  normal,  low  ability  group).  Ordinate  values  for  the 
middle  row  of  the  figure  were  based  on  the  low  average  sample,  and  the  bottom 
row  was  determined  from  the  very  high  ability  sample.  Response  patterns  were 
presumably  normal  for  these  two  samples  as  well  (we  had  not  applied  any 
aberrance  treatment).  Only  the  lower  left  quarter  of  each  ROC  curve  is  shown, 
in  order  to  conserve  space  and  because  we  are  primarily  concerned  with  an 
index's  standardization  for  low  misclassif ication  rates.  Results  for  the 
other  three  ability  ranges  are  not  shown  because  they  were  consistent  with  the 
trends  that  are  apparent  in  Figure  7. 

In  Figure  7,  it  is  clear  that  z^,  SR,  and  FI  are  not  consistently  well 

standardized;  z,  is  reasonably  well  standardized  across  ability  levels, 
although  its  performance  for  the  highest  ability  level  is  somewhat 
disappointing;  and  F2  is  fairly  well  standardized  across  ability  levels.  The 
most  surprising  results  are  the  very  accurate  standardizations  of  the  multi¬ 
test  extensions  of  T2  and  T4.  Their  standardizations  were  not  very  good  for 
the  long  unidimensional  test  studied  in  Chapter  II;  here,  their 
standardizations  are  excellent  except,  perhaps,  for  the  highest  ability  group. 

Detection  of  aberrant  response  patterns.  Tables  26  through  33  present 
the  detection  rates  for  the  multi-test  appropriateness  indices  when  they  are 
applied  to  actual  ASVAB  data.  Comparing  the  results  for  the  spuriously  high 
conditions  for  real  data  (Tables  26  through  29)  to  the  results  for  simulation 
data  (Tables  18  through  21)  reveals  generally  similar  detection  rates.  The 
detection  rates  for  the  polychotomous  model  optimal  index  LRp  tended  to  be 

moderately  decreased  for  the  actual  ASVAB  data,  but  detection  rates  for  the 
dichotomous  model  appropriateness  indices  were  relatively  unchanged. 

Of  the  practical  appropriateness  indices,  z,  is  clearly  the  most 
effective  for  the  lowest  ability  range.  The  T2  and  T4  indices  had  detection 
rates  comparable  to  z,  in  the  10%  to  30%  ability  range  and  appear  slightly 
superior  for  the  low  average  and  high  average  ability  ranges.  The  other  five 
practical  appropriateness  indices  (z^,  FI,  F2,  SR,  and  DFK)  all  had  detection 

rates  far  lower  than  z,,  T2,  and  T4. 

Although  the  detection  rates  for  the  spuriously  high  conditions  are 
similar  across  the  simulated  and  real  data  sets,  there  is  an  important 
difference:  Both  the  normal  and  the  aberrant  groups  for  the  actual  ASVAB  data 
sets  had  generally  larger  index  scores.  For  example,  1.6%  of  the  4,000 
simulated  normals  from  Study  One  had  z,  scores  less  than  -2.0,and  11.4%  had  z, 
scores  less  than  -1.0  .  For  the  2,716  N0RC  examinees  taken  as  the  normal 
group,  the  corresponding  rates  were  3.^%  and  16.2%.  This  trend  was  also 
apparent  for  T2,  T4,  and  the  three-parameter  logistic  optimal  index.  For 
example,  LR,  had  4.2%  and  12.9%  of  its  values  greater  than  5  and  2, 
respectively,  for  the  N0RC  normals,  versus  only  1.8%  and  7.7%  for  the  Study 
One  simulated  normals. 
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Table  26.  Selected  ROC  Points  for  Spuriously  High  Response  Patterns 
Created  from  NORC  Examinees  in  the  00-09%  Ability  Range 


Table  28.  Selected  ROC  Points  for  Spuriously  High  Response  Patterns 
Created  from  NORC  Examinees  in  the  31-48%  Ability  Range 


Proportion  detected  by 
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Table  30-  Selected  ROC  Points  for  Spuriously  Low  Response  Patterns 
Created  from  NORC  Examinees  in  the  31-48%  Ability  Range 


False 

alarm 

rate 

Test 

Proport  ion 

detected  by 

LRP 

LR, 

z 

P 

2> 

FI 

F2 

T2 

T4 

SR 

DFK 

15%  Spuriously 

Low 

Treatment 

.001 

AR 

00 

00 

01 

01 

00 

00 

00 

00 

WKPC 

01 

00 

00 

01 

00 

00 

01 

01 

MT 

02 

00 

00 

01 

00 

00 

oi 

00 

00 

00 

.01 

AR 

05 

01 

02 

02 

00 

01 

01 

01 

WKPC 

09 

02 

01 

07 

00 

01 

06 

05 

MT 

1 1 

04 

02 

05 

00 

01 

05 

03 

01 

00 

.03 

AR 

19 

05 

08 

08 

01 

05 

06 

06 

WKPC 

22 

15 

03 

16 

00 

04 

12 

14 

06 

MT 

26 

17 

06 

25 

00 

03 

09 

10 

00 

.05 

AR 

19 

1 1 

13 

12 

03 

09 

09 

07 

WKPC 

31 

27 

09 

20 

01 

05 

16 

18 

MT 

35 

26 

14 

20 

02 

08 

15 

16 

10 

00 

.  10 

AR 

30 

23 

23 

21 

13 

16 

16 

16 

WKPC 

46 

41 

26 

35 

13 

14 

26 

28 

MT 

52 

45 

32 

33 

14 

16 

27 

27 

18 

04 

30%  Spuriously 

Low 

Treatment 

.001 

AR 

01 

01 

01 

01 

00 

00 

01 

00 

WKPC 

07 

01 

01 

02 

00 

00 

03 

03 

MT 

03 

03 

00 

03 

00 

00 

02 

01 

01 

00 

.01 

AR 

1 1 

04 

04 

05 

00 

03 

03 

04 

WKPC 

24 

21 

02 

19 

00 

02 

1 1 

16 

MT 

26 

26 

07 

26 

00 

02 

10 

12 

06 

00 

.03 

AR 

25 

16 

13 

12 

01 

09 

09 

10 

WKPC 

41 

40 

16 

31 

00 

1 1 

20 

28 

MT 

50 

43 

23 

30 

00 

10 

17 

22 

15 

01 

.05 

AR 

31 

21 

20 

18 

06 

14 

14 

13 

WKPC 

49 

48 

33 

40 

03 

15 

27 

32 

MT 

62 

54 

40 

H 

05 

18 

25 

30 

23 

01 

.  10 

AR 

48 

39 

33 

31 

19 

22 

22 

23 

WKPC 

72 

62 

55 

56 

21 

30 

39 

49 

MT 

81 

71 

§1 

57 

25 

33 

42 

46 

33 

15 

SO 
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Table  3 1 •  Selected  ROC  Points  for  Spuriously  Low  Response  Patterns 
Created  from  NORC  Examinees  in  the  49-64%  Ability  Range 


False 
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rate  Test 


Proportion  detected  by 

z7  FI  F2  T2  T4  SR  DFlT 


Table  33-  Selected  ROC  Points  for  Spuriously  Low  Response  Patterns 
Created  from  NORC  Examinees  in  the  93-IOOJ  Ability  Range 
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rate  Test 


Proportion  detected  by 
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FI 
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T2 


T4 


SR 
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In  sura,  the  distributions  of  index  scores  for  the  NORC  normals  had  more 
extreme  values  than  did  the  distribution  for  the  Study  One  simulated  normals. 
Detection  rates  of  spuriously  high  examinees  did  not  significantly  decrease, 
however,  because  there  were  comparable  shifts  in  the  distributions  of  index 
scores  for  the  aberrant  samples. 

The  results  for  the  spuriously  low  conditions  are  shown  in  Tables  30 
through  33-  The  detection  rates  for  LR^  are  somewhat  lower  in  these  tables 

than  the  comparable  rates  (shown  in  Tables  22  through  25)  obtained  with 
simulated  data.  The  rates  for  LR,  and  z,  remained  basically  unchanged.  The 
detection  rates  for  LR^  decreased  for  two  reasons.  First,  as  noted  above,  the 

distributions  of  index  scores  for  the  NORC  normals  shifted  toward  more  extreme 
values.  Second,  the  distributions  of  LR^  scores  for  the  spuriously  low 

conditions  were  essentially  unchanged.  Thus,  the  "signal"  was  unchanged  but 
the  "noise"  increased;  therefore,  the  signal-to-noise  ratio  decreased. 

Although  the  rates  of  detection  of  spuriously  low  response  patterns  were 
lower  for  LR  with  the  NORC  data  than  with  the  simulated  data,  some  impressive 

detection  rates  were  nonetheless  obtained.  For  example,  LR^  detected  85%  and 

62%  of  the  15%  spuriously  low  examinees  for  the  very  high  and  high  ability 
ranges  at  a  1%  false  alarm  rate.  The  corresponding  rates  were  96%  and  76%  for 
the  30%  spuriously  low  treatment. 


Discussion 

The  transition  from  simulated  data  in  Study  One  to  real  data  in  Study  Two 
was  very  successful  for  the  three-parameter  logistic  appropriateness  indices. 

Although  detection  rates  for  LR  tended  to  be  lower  with  the  real  data,  some 

p 

impressive  results  were  nonetheless  obtained.  For  example,  82%  of  the  NORC 
examinees  in  the  lowest  ability  range  who  were  subjected  to  the  30%  spuriously 
high  treatment  could  be  detected  by  the  optimal  LR^  index  when  the  false  alarm 

rate  was  1%;  75%  could  be  detected  by  z,;  and  64%  could  be  detected  by  T2. 

In  contrast  to  the  high  detection  rates  obtained  by  the  IRT 
appropriateness  indices,  very  low  detection  rates  were  obtained  by  the  SR 
measure.  For  example,  only  4%  of  the  very  low  ability,  30%  spuriously  high 
response  patterns  were  identified  by  SR  at  a  1%-false  alarm  rate.  The  results 
for  SR  are,  in  fact,  even  worse  than  they  appear:  SR  is  based  on  30  AR  items, 
50  WKPC  items,  25  MK  items,  and  25  GS  items.  Thus,  a  total  of  130  items  were 
used  for  SR.  The  IRT  appropriateness  indices  used  only  80  items;  considerably 
higher  detection  rates  would  be  expected  if  all  130  items  were  used. 

The  transition  from  simulated  to  real  data  was  less  successful  for  the 
LRp  index  in  the  spuriously  low  conditions.  Detection  rates  were  lower  for 

the  real  data  because  the  distributions  of  index  scores  for  normal  NORC 
examinees  were  shifted  toward  more  extreme  values.  The  distributions  for 
spuriously  low  response  vectors,  in  contrast  to  the  spuriously  high  response 
patterns,  were  not  similarly  shifted. 
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One  hypothesis  about  the  differences  between  the  resuits  for  the  real  and 
simulated  data  sets  concerns  the  distributions  of  ability.  For  simulated 
data,  abilities  were  distributed  as  bivariate  normal,  with  zero  means,  unit 
variances,  and  a  correlation  of  .8.  The  distributions  of  ability  for  the  real 
data  were  clearly  nonnormal:  A  second  mode  of  the  density  was  evident  at 
0  =  -5.  This  second  mode  is  clearly  shown  in  Figure  4. 

Why  would  a  second  mode  appear  at  a  very  low  ability?  Since  the  NORC 
examinees  were  not  a  sample  of  actual  recruits,  i*  is  possible  that  some  were 
poorly  motivated  to  do  their  best.  Indeed,  some  examinees  omitted  every  item 
on  entire  tests.  Thus,  we  are  led  to  hypothesize  that  the  bivariate  ability 
distribution  contains  a  nontrivial  point  mass  corresponding  to  examinees  who 
were  very  poorly  motivated.  An  optimal  index  for  spuriously  low  examinees 
based  on  the  est imated  distribution  of  ability  should  lead  to  increased  rates 
of  detection. 


V.  DISCUSSION 


In  the  present  effort,  several  new  appropriateness  indices  were 
developed.  These  Indices,  as  well  as  a  number  of  appropriateness  indices 
previously  developed,  were  carefully  evaluated  in  a  series  of  studies.  By 
comparing  detection  rates  to  the  rates  obtained  by  the  optimal  apropr iateness 
indices  developed  by  Levine  and  Drasgow  (1984;  1987),  we  were  able  to 
determine  the  effectiveness  of  all  of  the  indices  in  an  absolute  sense. 
Detection  rates  for  the  three  best  practical  indices  (z,,  12,  and  T4 )  are 
presented  in  Figure  8  as  percentages  of  the  optimal  detection  rate  (at  a  \% 
error  rate). 

A  major  result  of  this  effort  is  the  finding  that  a  few  of  the  practical 
appropriateness  indices  (namely,  z,,  T2,  and  T4)  effectively  detect  aberrant 
response  patterns  across  a  fairly  wide  range  of  conditions.  Multi-test 
extensions  of  these  indices  were  developed  for  situations  in  which  examinees 
complete  a  battery  of  short  unidimensional  tests.  The  multi-test  extensions 
of  Z],  T2,  and  T4  were  found  to  provide  high  rates  of  detection  of  aberrant 
response  patterns  when  simulated  and  actual  ASVAB  data  were  used.  Thus,  it 
was  concluded  that  these  indices,  which  are  all  based  on  IRT,  are  strong 
candidates  for  use  in  operational  settings. 

The  standardized  residual  (SR)  index  provides  another  approach  to  t ■>- 
detection  of  inappropriate  response  patterns.  Unlike  IRT  indices,  wnicr 
analyze  the  internal  consistency  of  a  response  pattern,  the  SR  index 
external  information  such  as  scores  on  other  tests.  This  external  ev.i- 
used  to  predict  scores  on  the  tests  of  interest  (e.g.,  AFQT  subtests  .  -. 
large  errors  of  prediction  are  taken  as  indicating  that  test  scores  i 
aberrant . 

The  SR  index,  in  contrast  to  the  IRT  indices,  was  fours  • 
under  all  conditions.  It  therefore  seems  to  be  a  weak  oper.it 
an  important  idea.  IRT  provides  a  much  more  precise  and  p.. -■  1  • 
detecting  aberrant  response  patterns  than  the  classical 
concepts  used  by  SR. 


AFQT  Category 


Figure  8.  Detection  rates  of  z,,  T2,  and  T4  expressed  as  proportions 
of  the  rate  of  the  optimal  index  at  a  1%  false  alarm  rate. 
(Rates  are  not  plotted  when  the  optimal  index  detected  less 
than  1 0%  of  the  aberrant  sample.) 


How  effective  are  the  best  practical  appropriateness  indices  in  relation 
to  optimal  indices?  The  practical  appropriateness  indices  are  much  better 
than  non-IRT  alternatives  such  as  the  SR  measure,  but  sometimes  fall  short  of 
optimal.  Therefore,  it  seems  that  operational  use  of  z,,  T2,  and  T4  is 
Justified.  Moreover,  a  program  of  research  designed  to  develop  and  validate 
better  practical  appropriateness  indices  is  also  warranted.  This  conclusion 
was  reached  because  z,,  T2,  and  T4  decisively  outperformed  SR  and  other  IRT 
indices,  but  fell  short  of  optimality  in  some  cases. 

The  optimal  appropriateness  indices  used  in  the  present  research  seem  to 
be  simultaneously  too  specific  and  not  specific  enough  to  use  as  practical 
appropriateness  indices.  They  are  too  specific  in  that  different  optimal 
indices  must  be  computed  for  differing  percentages  of  spuriously  high  and 
spuriously  low  responses.  They  are  not  specific  enough  in  that  ability  is 
assumed  to  be  distributed  as  standard  normal  in  both  the  normal  and  aberrant 
groups.  More  specific  assumptions  about  ability  distributions,  particularly 
for  the  aberrant  group,  would  seem  to  be  desirable  in  many  situations. 

Therefore,  it  is  important  to  develop  a  "second  generation"  of  optimal 
indices  that  could  be  used  in  practice  to  test  hypotheses  that  are  very 
general  in  some  ways  but  very  specific  in  others.  Examples  of  some  hypotheses 
that  may  be  important  to  test  include  the  following: 

1.  Was  a  response  vector  generated  by  a  normal  examinee  or  was  it 
generated  by  a  very  low  ability  (AFQT  Category  V)  examinee  who  was  cheating  on 
10  to  30  items?  Low  ability  cheaters  would  be  expected  to  have  high  rates  of 
attrition  in  training  and  generally  poor  on-the-job  performance,  both  of  which 
are  very  costly. 

2.  Was  a  response  vector  generated  by  a  high  average  (AFQT  Category  3A) 
examinee  or  a  low  average  (AFQT  Category  3B)  examinee  who  was  cheating  on  a 
moderate  number  of  items?  Recruitment  bonuses  for  AFQT  Category  3A  scores  may 
provide  a  powerful  incentive  for  examinees  slightly  below  average  to  cheat. 

3.  Suppose  it  is  known  that  part  or  all  of  one  subtest  is  no  longer 
secure.  Was  a  response  pattern  generated  by  a  normal  examinee  or  by  an 
examinee  who  had  prior  access  to  the  compromised  items? 

4.  Are  members  of  an  ethnic  minority  penalized  because  a  test  was 
developed  and  standardized  using  majority  group  members  as  examinees?  The 
likelihood  of  the  response  pattern  could  be  computed  using  item  parameters 
estimated  from  a  majority  group  sample  and  from  a  minority  group  sample.  If 
the  test  is  fair,  then  even  the  optimal  appropriateness  index  would  be  unable 
to  effectively  classify  majority  and  minority  group  members.  In  this  way,  the 
methodology  of  optimal  indices  is  applied  to  determine  the  extent  to  which 
ethnicity  can  be  determined  from  item  response  patterns. 

Refinements  in  optimal  indices  would  enable  very  powerful  detection  of 
aberrant  response  patterns.  For  example,  suppose  we  suspect  that  a  very  low 
ability  examinee  has  been  given  answers  to  a  moderate  number  of  items  on  the 
AR,  WK,  and  PC  subtests  in  order  to  obtain  an  AFQT  score  that  qualifies 
him/her  for  a  bonus.  Furthermore,  suppose  that  there  was  no  cheating  on  the 
non-AFQT  subtests.  Then  we  could  test  the  hypothesis  that  the  examinee  was 


normal  against  the  hypothesis  that  a  low  ability  examinee  cheated  on  20  to  30 
items  on  the  AR,  WK,  and  PC  subtests  and  cheated  on  0  items  on  the  MK  and  GS 
tests.  Examinees  who  are  aberrant  in  this  particular  way  should  be  clearly 
identifiable . 

A  significant  part  of  the  theory  necessary  for  more  sophisticated  optimal 
indices  has  already  been  developed  by  Levine  and  Drasgow  (1984;  1987). 
Nonetheless,  a  considerable  amount  of  work  is  necessary  to  transform  their 
theoretical  notions,  which  were  developed  in  the  context  of  a  unidimensional 
latent  trait  space,  into  methods  that  can  be  used  to  test  the  aberrance 
hypotheses  listed  above. 

It  may  seem  that  computing  second-generation  optimal  indices  would  be 
extremely  burdensome.  It  is  true  that  extensive  calculations  would  be 
necessary.  The  recursive  methods  described  by  Levine  and  Drasgow  (1984;  1987) 
and  the  quadratic  approximation  and  multi-test  generalizations  developed  here 
considerably  reduce  the  computing  load.  Furthermore,  the  rapid  advances  in 
Levine's  (1985a;  1985b)  MFS  theory  allow  algebraic  simplifications  and 
eliminate  the  need  for  arbitrary  assumptions  about  the  ability  density.  In 
particular,  MFS  now  permits  one  to  bypass  the  quadratic  approximation  used  in 
Chapter  IV  and  relax  the  assumption  of  multivariate  normal  abilities. 
Multidimensional  extensions  of  Levine's  theory  are  being  developed  to  estimate 
the  Joint  distribution  of  several  abilities. 

Finally,  there  are  two  important  substantive  questions  about 
Appropriateness  Measurement  that  need  to  be  addressed.  First,  the  ability 
densities  estimated  from  the  NORC  sample  depart  significantly  from  a  normal 
density.  This  has  led  us  to  reconsider  the  way  in  which  we  compute  optimal 
indices.  However,  the  NORC  sample  is  not  a  sample  of  individuals  who  are 
actually  trying  to  enlist  in  the  military.  Would  our  results  concerning 
ability  densities  be  replicated  if  data  from  actual  recruits  were  used?  Or 
would  the  results  be  more  similar  to  our  studies  with  SAT-V  data? 

The  second  substantive  question  concerns  the  distributions  of 
appropriateness  index  scores  in  samples  of  women  and  ethnic  minorities. 

Finding  similar  distributions  across  all  relevant  groups  would  support  the 
view  that  standardized  tests  in  general,  and  the  ASVAB  in  particular,  assess 
ability  fairly.  This  finding  would  be  highly  significant  in  light  of  the 
underprediction  of  women's  performances  reported  in  some  military  training 
schools  (Dunbar  4  Novick,  1985). 
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APPENDIX  C:  MULTITEST  EXTENSIONS  OF  OPTIMAL  INDICES 


An  approximation  to  the  likelihoods  required  for  an  optimal  statistic  for 
two  unidimensional  tests  is  given  in  this  appendix.  The  approach  easily 
generalizes  to  m  >  2  dimensions. 

To  begin,  rewrite  F*  from  Equation  31  as 

F*  =  //  {P(U1  =  u1ie1)[<t>(e1)/<f>(e1)]j 

*{P(U2  =  u2ie2)(<J(02)/0(02)]}  4.2(0;O,E)d0 

•  n  [ea10?+bl0,+c,/^(0^)]  [ea^e^b^9^cVct.(02)].(J,2(0;O,E)d0 
.  ;;  ea,0i+b,01+ct  e(  1/2)01  ea20*+b202+c2  g(  1  /2 )0| 


.(detL)-172  e(ei-2P0,02+0i)/2(PJ-1)d0 


=  £  . 

where  $(•)  is  the  standard  normal  density.  For  the  next  step  in  our  analysis, 
it  is  useful  to  rewrite  this  equation  in  matrix  notation. 

Consequently,  let 
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v.V; 


and 


a 


A  =  A1  «■  K1  +  A2  ♦  K2  «■  K3 


b  =  b1  +  b2  . 


To  complete  the  square  in  the  exponent  of  the  above  integrand,  notice 
that  since  A  is  symmetric, 

8'A0  +  b'A_1A0  +  ^b' A-1 AA-1b  -  ^b'A_1b 

=  (0  ♦  ^tC\)'k(Q  *  ^A_1b)  -  ^b'  A_1b  , 
provided  that  A  is  negative  definite.  Diagonalize  A  by  A  =  VAV', 


K  .  .-1 


where  V'V  =  I,  let  k  =  -  jb'A  b,  and  let  w  =  (detE)  e 


1/2  c,+c2+k 


F  =  (detE) 


•1/2  ec,i-c2-eK  n  exp[(Q  +  -1ft"1  b) .  VAV.  (q  +  ^A_1  b)  ]d8 


=  w  JJ  exp[0'VAV'0]d0 
=  w  J  J  exp(t'At]dt  , 

where  t  =  ( t 1 ,  t2)'  =  0'V,  because  the  Jacobian  of  the  transformation  is  one. 

The  middle  equality  above  holds  because  the  volume  of  the  bivariate  density  is 
unaffected  by  the  location  parameter.  Since  A  is  diagonal  with 
negative  diagonal  elements  and  A2, 

F  =  w  /  exp[-  1  t_^(-2A,  )]dt1  /  exp[-  ^  t2(-2A2)  )dt2 

=  w  /  exp[-  |  ]d 1 1  J  exp[-  1  t|  /  o2]dt2 

=  2tiwo^o2  , 

where  -2Aj  =  I/O2,  j  =  1,2.  Because  2o^o2  =  1//A^A2  =  (detA)  1/2 
=  (detA)"1/l2,  we  obtain 

F  =  nw(detA)  1/2 

=  n  exp(£1  +  c2  -  b'A_1b/4j  (detE)  1/2  (detA)  1/2 
as  the  final  expression  for  our  approximation  to  F*  given  in  Equation  31. 


